Nvidia Turing Speculation thread [2018]

Status
Not open for further replies.
Take the INT + FP thing with a large grain of salt until we have more details. NVIDIA has never done a very good job of communicating how that works on Volta; I don't expect Turing to immediately be better.
 
RT core specs: 10 Gray/s, ray triangle intersection, BVH traversal

Are these coherent primary rays or incoherent secondary rays ?
How can you specify the number of rays per second for raytracing, as this does depend on scene complexity AFAIK, like nr of triangles ?
Since marketing folks like big numbers it's likely the number of ray triangle intersection tests that can be completed per second.
 
Take the INT + FP thing with a large grain of salt until we have more details. NVIDIA has never done a very good job of communicating how that works on Volta; I don't expect Turing to immediately be better.

What info are we missing? There are 16 FP units and 16 INT units per partition and the dispatcher can issue a 32 wide warp per clock. Which means every other clock it can issue to the FP or INT units and keep both fully occupied.
 
Since marketing folks like big numbers it's likely the number of ray triangle intersection tests that can be completed per second.
That would be a rather low number then. Rays/second normally include full BVH traversal till the intersection with the right triangle is found.
Here a recent raytracing Nvidia paper I found: http://research.nvidia.com/publication/2017-07_Efficient-Incoherent-Ray
In table 4, using a Titan XP, they report from 1.987 Gray/s (Sibenik) to 322 Mray/s (Powerplant).
Sibenik is 80 K triangles, Powerplant is 12.8 M triangles. (Table 6)
So Gray/s without context doesn't make much sense.
 
Seems the real time ray-traced demos (Star Wars, Dancing Robots) that used Unreal on a DGX workstation (4 Volta GPU's) can now be seen using a single Turing Quadro. The recent Porche real-time ray tracing demo requires 2 RTX Quadros.
  • Photorealistic, interactive car rendering — Spoiler: this demo of a Porsche prototype looks real, but is actually rendered. To prove it, you’ll be able to adjust the lighting, and move the car around. It’s all built in in Unreal Engine, with the Microsoft DXR API is used to access NVIDIA RTX dev platform. It runs two Quadro RTX GPUs.
  • Real-time ray tracing on a single GPU — This Star Wars themed demo stunned when it made its debut earlier this year running on a $70,000 DGX Station powered by four Volta GPUs. Now you can see the same interactive, real-time, ray tracing using Unreal Engine running on our NVIDIA RTX developer platform on a single Turing Quadro GPU.
  • Advanced rendering for games & film (dancing robots) — This one is built on Unreal, as well — and shows how real-time ray-tracing can a bring complex, action-packed scenes to life. Powered by a single Quadro RTX 6000, it shows effects such as real-time ray-traced effects such as global illumination, shadows, ambient occlusion, and reflections.
https://blogs.nvidia.com/blog/2018/08/14/turing-demos-real-time-ray-tracing/
 
Apparently some companies at SIGGRAPH are already showing RTX demos:
“NVIDIA RTX ray-tracing hardware is the future – and will define the next decades of GPU rendering. At SIGGRAPH, we’re demonstrating performance improvements of 5-8x with Octane 2019’s path-tracing kernel – running at 3.2 billion rays/second on NVIDIA’s new Quadro RTX 6000 – compared to 400 millions rays/second on P6000” — Jules Urbach, chief executive officer, Otoy
https://blogs.nvidia.com/blog/2018/08/13/turing-industry-support/
 
Seems the real time ray-traced demos (Star Wars, Dancing Robots) that used Unreal on a DGX workstation (4 Volta GPU's) can now be seen using a single Turing Quadro. The recent Porche real-time ray tracing demo requires 2 RTX Quadros.
  • Photorealistic, interactive car rendering — Spoiler: this demo of a Porsche prototype looks real, but is actually rendered. To prove it, you’ll be able to adjust the lighting, and move the car around. It’s all built in in Unreal Engine, with the Microsoft DXR API is used to access NVIDIA RTX dev platform. It runs two Quadro RTX GPUs.
  • Real-time ray tracing on a single GPU — This Star Wars themed demo stunned when it made its debut earlier this year running on a $70,000 DGX Station powered by four Volta GPUs. Now you can see the same interactive, real-time, ray tracing using Unreal Engine running on our NVIDIA RTX developer platform on a single Turing Quadro GPU.
  • Advanced rendering for games & film (dancing robots) — This one is built on Unreal, as well — and shows how real-time ray-tracing can a bring complex, action-packed scenes to life. Powered by a single Quadro RTX 6000, it shows effects such as real-time ray-traced effects such as global illumination, shadows, ambient occlusion, and reflections.
https://blogs.nvidia.com/blog/2018/08/14/turing-demos-real-time-ray-tracing/
Pica Pica & Star Wars RT demo didn't require 4xTitan V's thought.
The use of a DGX was simply marketing from Nvidia as the product was on sale - 50% off during GDC.
 
Last edited:
That would be a rather low number then. Rays/second normally include full BVH traversal till the intersection with the right triangle is found.
Here a recent raytracing Nvidia paper I found: http://research.nvidia.com/publication/2017-07_Efficient-Incoherent-Ray
In table 4, using a Titan XP, they report from 1.987 Gray/s (Sibenik) to 322 Mray/s (Powerplant).
Sibenik is 80 K triangles, Powerplant is 12.8 M triangles. (Table 6)
So Gray/s without context doesn't make much sense.
It's just a theoretical maximum like FLOPs. The only way it might be achieved is colliding against a single screen space triangle with zero bounces. I doubt the traversal is even included as that will always vary based on the model.
 
If the RT cores are a thing, it's not surprising if Volta is lacking the dedicated hardware.

The only strange thing for me is the 2 architectures for differents markets. Before is was a big core, and then you cut down some features.

Or, the futur is turing everywhere, and Volta was really short lived.
 
It's just a theoretical maximum like FLOPs. The only way it might be achieved is colliding against a single screen space triangle with zero bounces. I doubt the traversal is even included as that will always vary based on the model.
Quoting ray tracing speed for a single triangle, that would be a joke, right.
When rendering a single triangle raytracing becomes essentially equivalent to rasterizing, which GPUs can do at 100s of Gpix/s.
Raytracing is all about efficient accelerations structures, which are needed when rendering scenes with many triangles.
I'm starting to doubt now if those RT cores are actual hardware and not just some GPU code making clever use of the abundant int cores and improved L1 cache.
BTW What about raytracing participating media like smoke, fire, these are represented by volumetric textures, no BVH and triangle intersections needed, so no speedup by RT cores ?
 
As a raytracing tech demo this must be fast or impressive, as a demo for nice game visuals it would be less impressive.
It would be quite impressive to see a seaside full of small stones or even sand particles in games. My guess is that as long as these are static and instanced across the scene, it would be quite cheap for RT- no BVH updates frame to frame, just shoot N millions parallel primary rays to draw the same instanced models across the scene (should be also quite cache-friendly)
 
NVIDIA Breaks New Ground with Turing GPU Architecture
In a bid to reinvent computer graphics and visualization, NVIDIA has developed a new architecture that merges AI, ray tracing, rasterization, and computation. The new architecture, known as Turing, was unveiled this week by NVIDIA CEO Jensen Huang in his keynote address at SIGGRAPH 2018.

A key element in the Turing architecture is the RT Cores, a specialized bit of circuitry that enables real-time ray tracing for accurate shadowing, reflections, refractions, and global illumination. Ray tracing essentially simulates light, which sounds simple enough, but it turns out to be very computationally intense. As the above product chart shows, the new Quadros can simulate up to 10 billion rays per second, which would be impossible with a more generic GPU design.

The on-board memory is based on GDDR6, which is something of a departure from the Quadro GV100, which incorporated 32GB of HBM2 memory. Memory capacity on the new RTX processors can be effectively doubled by hooking two GPUs together via NVLink, making it possible to hold larger images in local memory.

As usual, the SM will supply compute and graphics rasterization, but with a few twists. With Turing, NVIDIA has separated the floating point and integer pipelines so that they can operate simultaneously, a feature that is also available in the Volta V100. This enables the GPU to do address calculations and numerical calculation at the same time, which can be big time saver. As a result, the new Quadro chips can deliver up to 16 teraflops and 16 teraops of floating point and integer operations, respectively, in parallel. The SM also comes with a unified cache with double the bandwidth of the previous generation architecture.

Perhaps the most interesting aspect to the new Quadro processors is the Turing Tensor Cores. For graphics and visualization work, the Tensor Cores can be used for things like AI-based denoising, deep learning anti-aliasing (DLAA), frame interpolation, and resolution scaling. These techniques can be used to reduce render time, increase image resolution, or create special effects

The Turing Tensor Cores are similar to those in the Volta-based V100 GPU, but in this updated version NVIDIA has significantly boosted tensor calculations for INT8 (8-bit integer), which are commonly used for inferencing neural networks. In the V100, INT8 performance topped out at 62.8 teraops, but in the Quadro RTX chips, this has been boosted to a whopping 250 teraops. The new Tensor Cores also provide an INT4 (4-bit integer) capability for certain types of inferencing work that can get by with even less precision. That doubles the tensor performance to 500 teraops – half a petaop. The new Tensor Cores also provide 125 teraflops for FP16 data – same as the V100 – if for some reason you decide to use the Quadro chips for neural net training.

https://www.top500.org/news/nvidia-breaks-new-ground-with-turing-gpu-architecture/
 
Last edited:
Status
Not open for further replies.
Back
Top