Nvidia Turing Speculation thread [2018]

ShaidarHaran · Aug 18, 2018

My hope for NVlink is that NV will offer an operational mode that pools VRAM without attempting to split the workload. Essentially, one GPU would perform all the calculations of a given workload with the other sitting idle or perhaps only assisting trivially. As I said, I don't need more performance necessarily (though I won't say no to it of course), I need about the level of performance which is already available via say GP102 but with more VRAM.

silent_guy · Aug 18, 2018

Communism said:
While great, it's still stuck in the canyon where you cannot simultaneously work on the same "frame" with both GPUs without latency/lag inducing queuing of multiple "frames" of input.

Probably need ~50% of local memory bandwidth as link bandwidth between dies before that can be a reality (at massive power cost at the very least, not to mention the die area cost).

The reason you need BW between GPUs to work on the same frame is to give GPU B access to data that was generated by GPU A and vice versa.

You don’t need it for other data, such as non-rendered textures, Z buffer etc.

It’s clear that these shared resources are only a fraction of the total memory BW. I don’t know how much, but it seems very unlikely that they are 50%. So that number seems very wrong to me.

Are they less than 8% (50GBps/~600GBps) ? Probably not. But even if they are more than that, you could still have meaningful performance increases.

It’s also very game dependent, of course.

That said: if NVLink and tensor cores work in the GeForce GPUs the way they work on Tesla class GPUs, then deep learning people will try gobble all of them up anyway, so it’s all academic.

silent_guy · Aug 18, 2018

ShaidarHaran said:
My hope for NVlink is that NV will offer an operational mode that pools VRAM without attempting to split the workload. Essentially, one GPU would perform all the calculations of a given workload with the other sitting idle or perhaps only assisting trivially. As I said, I don't need more performance necessarily (though I won't say no to it of course), I need about the level of performance which is already available via say GP102 but with more VRAM.

For that kind of use case, the NVLink BW is almost certainly way too low. So don’t count on it.

ShaidarHaran · Aug 18, 2018

silent_guy said:
For that kind of use case, the NVLink BW is almost certainly way too low. So don’t count on it.

Not necessarily. My particular use case requires a large texture cache for textures that would not be loaded/unloaded frequently.

Communism · Aug 18, 2018

silent_guy said:
The reason you need BW between GPUs to work on the same frame is to give GPU B access to data that was generated by GPU A and vice versa.

You don’t need it for other data, such as non-rendered textures, Z buffer etc.

It’s clear that these shared resources are only a fraction of the total memory BW. I don’t know how much, but it seems very unlikely that they are 50%. So that number seems very wrong to me.

Are they less than 8% (50GBps/~600GBps) ? Probably not. But even if they are more than that, you could still have meaningful performance increases.

It’s also very game dependent, of course.

That said: if NVLink and tensor cores work in the GeForce GPUs the way they work on Tesla class GPUs, then deep learning people will try gobble all of them up anyway, so it’s all academic.

That would be true if you are doing old school rendering.

If you are doing deferred rendering, you have to synchronize all the intermediate steps, which means an absolute crapton of bandwidth needed.

I'm sure there's not that many deep learning students to make an appreciable dent in RTX 2080 Ti supplies anyways, and serious deep learning people will need/want the extra VRAM that the Quadros have.

But yeah, there's every reason to think Nvidia would lock down/artificially slow down anything to do with deep learning / hpc outside of pointless epeen benchmarks and directly gaming related stuff in their Geforce drivers.

CSI PC · Aug 18, 2018

Communism said:
While great, it's still stuck in the canyon where you cannot simultaneously work on the same "frame" with both GPUs without latency/lag inducing queuing of multiple "frames" of input.

Probably need ~50% of local memory bandwidth as link bandwidth between dies before that can be a reality (at massive power cost at the very least, not to mention the die area cost).

Anything less than that and you are still left with AFR 2 with 2+ frames of input latency.

Remember though NVLink2 is a more cohesive cache design and makes them actually closer to being 1 GPU in design.
So it is a bit limited with 100GB/s (both links can be used to one GPU) but gains by having slightly better latency than SLI and importantly better communication/integration between the GPUs for developers/engines, while also better BW than SLI or even PCIe x16gen4 when using both Nvidia bricks in a dual GPU setup; NVLink is designed to be flexible so it would be fine but yeah still has limitations.

Edit:
I would need to check, but I thought real world application comparison between NVLink and SLI was about 25% gains, but big caveat this is not gaming and not sure that was the latest NVLink iteration.

Communism · Aug 18, 2018

CSI PC said:
Remember though NVLink2 is a more cohesive cache design and makes them actually closer to being 1 GPU in design.
So it is a bit limited with 100GB/s (both links can be used to one GPU) but gains by having slightly better latency than SLI and importantly better communication/integration between the GPUs for developers/engines, while also better BW than SLI or even PCIe x16gen4 when using both Nvidia bricks in a dual GPU setup; NVLink is designed to be flexible so it would be fine.

I'm not sure you understand the gigantic gulf in bandwidth requirement between AFR 2 with 2+ frame latency and working on a single frame with no artificially added latency.

One of these literally requires not much more than enough bandwidth to send over the finished frame in time, while the other requires both GPUs to synchronize every single intermediate render state.

Rootax · Aug 18, 2018

ShaidarHaran said:
Not necessarily. My particular use case requires a large texture cache for textures that would not be loaded/unloaded frequently.

Of course I don't know your precise need, but, in that particular case, why not a Vega FE with 16gb of ram or even a Vega Pro SSG ?

CSI PC · Aug 18, 2018

Communism said:
I'm not sure you understand the gigantic gulf in bandwidth requirement between AFR 2 with 2+ frame latency and working on a single frame with no artificially added latency.

One of these literally requires not much more than enough bandwidth to send over the finished frame in time, while the other requires both GPUs to synchronize every single intermediate render state.

Comes down to how well engines/API (more so DX12/Vulkan) can integrate with NVLink/unified-cohesive cache in Multi-GPU setup and could be different to SLI or what we see with AFR.
TBH I need to see how UE4 has integrated NVLink.

Edit:
While not exactly the same worth noting several of the UE4 demos have used NVLink when driven by multiple GPUs for their more impressive real-time demonstrations.
The real-time Star Wars Ray Tracing demo was splitting work between 4 V100 GPUs (each doing different tasks) using NVLink, but that is just one example.

Communism · Aug 18, 2018

CSI PC said:
Comes down to how well engines/API (more so DX12/Vulkan) can integrate with NVLink/unified-cohesive cache in Multi-GPU setup and could be different to SLI or what we see with AFR.
TBH I need to see how UE4 has integrated NVLink.

Edit:
While not exactly the same worth noting several of the UE4 demos have used NVLink when driven by multiple GPUs for their more impressive real-time demonstrations.
The real-time Star Wars Ray Tracing demo was splitting work between 4 V100 GPUs (each doing different tasks) using NVLink, but that is just one example.

4 frames of latency obviously doesn't matter for when you are watching a video render (aka, what that showcase is).

In real games every single frame of latency matters.

Let's just agree that you don't entirely understand the situation in this particular scenario that i was talking to silent_guy about so we don't have to talk in circles ok?

CSI PC · Aug 18, 2018

Communism said:
4 frames of latency obviously doesn't matter for when you are watching a video render (aka, what that showcase is).

In real games every single frame of latency matters.

Let's just agree that you don't entirely understand the situation in this particular scenario that i was talking to silent_guy about so we don't have to talk in circles ok?

I know which is why I said it was one example, but the point it is a different presentation when compared to SLI/AFR.
Does it have to be AFR/work like SLI?
Answer is no, but it comes down to how well the API/engines can utilise NVLink and splitting tasks without it being traditional AFR.

Edit:
Just to clarify I am not talking about supporting older games but a few current and future games.

ShaidarHaran · Aug 18, 2018

Rootax said:
Of course I don't know your precise need, but, in that particular case, why not a Vega FE with 16gb of ram or even a Vega Pro SSG ?

Radeon SSG was a very intriguing product when I first heard of it. Unfortunately, the price is too high and the performance isn't quite where I need it to be.

The particular use case I have in mind involves lots of triangles and many textures (up to 4k in size) spreading out over a view distance of perhaps 100 miles or more. Much of the detail is lost to LOD, but there are still quality and performance gains to be had by keeping all the textures in VRAM and not needing to fetch from system RAM or disk. I don't know if its drivers, architecture, or some of both but AMD's products have not historically excelled in this workload. The workload is Lockheed Martin Prepar3d, for anyone interested. Now that flight simulators have moved to 64-bit, the potential to display the highest quality textures "as far as the eye can see" basically fulfills my lifelong desire for visualization in this class of software.

This generation may be too soon though, at least for the price I'm willing to pay. I could see maybe spending up to $2000 for a couple graphics cards if I knew they would get the job done and allow me to stop upgrading every cycle, but $6000 for a Quadro RTX 6000 is more than I'm willing to pay.

silent_guy · Aug 18, 2018

Communism said:
That would be true if you are doing old school rendering.

If you are doing deferred rendering, you have to synchronize all the intermediate steps, which means an absolute crapton of bandwidth needed.

Feel free to provide say why you’d need 50% for that. I think it’s not even close.

BRiT · Aug 18, 2018

Please keep the personal bickering in check.

CSI PC · Aug 19, 2018

Did any sites mention yet about the improved interopability between CUDA and gaming APIs specifically DX12 and Vulkan with Turing?
It is part of CUDA 10 platform (seems 'SM_80' onwards function compatibility), and something that should be promising.

Edit:
Also of note for Turing is further optimized performance with mixed-precision GEMM in CUDA 10.

Geeforcer · Aug 19, 2018

Does anyone know if any reviews have revived the cards or is Monday just going to be archtexture overview?

tEd · Aug 19, 2018

Probably the second. Maybe some benches from nvidia. I think reviewers receive the cards at the event.

Deleted member 2197 · Aug 19, 2018

Rumor ...
NVIDIA RTX 2070 Specs Leaked – 2304 Cores, 8GB GDDR6 at ~$400

As opposed to the GeForce RTX 2080 which has been confirmed to feature 23 Turing SMs with a total of 2944 CUDA cores, the RTX 2070 will come in at only 18 Turing SMs for a total of 2304 CUDA cores. One very interesting tidbit that has also come out is that the RTX 2070 will in fact feature 8GB of GDDR6 memory, rather than 7GB as was previously believed. This was claimed by two separate sources according to videocardz.

https://wccftech.com/nvidia-rtx-2070-specs-leaked-2304-cores-8gb-gddr6-at-400/

Markus · Aug 19, 2018

Will real time raytracing have interesting non-graphics uses in games? 3D positional audio has not recovered since the Aureal A3D days. I know about AMD trueadio next an I wish more games used it. Would something like that benefit from being able to cast grotesquely many more rays or is the bottleneck still time-varying convolution kernels etc?

OCASM · Aug 19, 2018

Ike Turner said:
It will..

https://twitter.com/x/status/1030822744575954944

People shouldn't downplay the RT Cores IMO. But they also shouldn't expect anything more that shoddy Gameworks stuff on the gaming side. RTRT in games for useful features that don't require pixel peeping is still years away. But for everything non-gaming Nvidia just brought one hell of a product & I can't wait to grab one for myself for work either a 2080Ti or Quadro RTX 5000 depending on price and performance for what I need.

I don't know about that. OTOY has integrated Octane into Unity already and they've just officially announced an integration into UE4.

Markus said:
Will real time raytracing have interesting non-graphics uses in games? 3D positional audio has not recovered since the Aureal A3D days. I know about AMD trueadio next an I wish more games used it. Would something like that benefit from being able to cast grotesquely many more rays or is the bottleneck still time-varying convolution kernels etc?

Yes:

https://www.techpowerup.com/246820/nvidia-does-a-trueaudio-rt-cores-also-compute-sound-ray-tracing

Nvidia Turing Speculation thread [2018]

ShaidarHaran

hardware monkey

silent_guy

silent_guy

ShaidarHaran

hardware monkey

Communism

CSI PC

Communism

Rootax

CSI PC

Communism

CSI PC

ShaidarHaran

hardware monkey

silent_guy

BRiT

(>• •)>⌐■-■ (⌐■-■)

CSI PC

Geeforcer

Harmlessly Evil

tEd

Casual Member

Deleted member 2197

Guest

Markus

OCASM

Similar threads