Nvidia Volta Speculation Thread

So they solve the core ray tracing performance problem by simply rendering less of them and guessing the gaps?

That’s certainly a novel way of attacking the problem and it might just work if you mix traditional rendering with ray tracing for lighting only.

Are there any demos out?
They kinda demo'd it last year separate to DXR-RTX, involved a car (not the holodeck stuff) but some of the demos in the other thread suggest they are using this real-time solution with Volta in their demos as well.
https://forum.beyond3d.com/threads/directx-ray-tracing.60670/
 
Last edited:
So they solve the core ray tracing performance problem by simply rendering less of them and guessing the gaps?

That’s certainly a novel way of attacking the problem and it might just work if you mix traditional rendering with ray tracing for lighting only.

Just off-the-cuff musing on this is applying something analogous to reconstruction from nearby samples in prior frames, or using some of the meta-type properties of the surface or object to create more locally-accurate rules for interpolating values.
In a checkerboarded method, there's accumulated information stored in the form of render targets written earlier, while in a trained network there's an element of prior history being built into the weights in the network in addition to other data.

A network could find other correlations about a surface or specific properties that tend to behave similarly or are not perturbed within certain bounds.
Perhaps a number of local networks will pick up that a given area of rays tends to behave similarly, and that prior samples in time can combine with the learned behavior. Given the hardware's preference for SIMD layouts and array math, a good portion might like laying things out in a grid, and it might infer that a good approximation can be derived by their relationship with each along some kind of plane.
 
So they solve the core ray tracing performance problem by simply rendering less of them and guessing the gaps?

That’s certainly a novel way of attacking the problem and it might just work if you mix traditional rendering with ray tracing for lighting only.

Are there any demos out?
This is probably best example with further information on approach as I can also link the article that highlights the performance gain depending upon GPU used and whether also Tensor cores:
http://research.nvidia.com/publicat...rlo-image-sequences-using-recurrent-denoising
https://www.aecmag.com/technology-m...deliver-8x-speed-boost-to-ray-trace-rendering

1st link describes the approach to denoise autoencoder, but the 2nd link shows the interesting performance gain on comparing P100, V100, V100+AI.
Still the demos in the rendering thread are more applicable anyway IMO and more recent so a more mature solution.

Edit:
Knew I had seen the demo/presentation somewhere, relates to the AECmag ray tracing image:
http://on-demand.gputechconf.com/si...-michael-thamm-advanced-rendering-nvidia.html
 
Last edited:
Just off-the-cuff musing on this is applying something analogous to reconstruction from nearby samples in prior frames, or using some of the meta-type properties of the surface or object to create more locally-accurate rules for interpolating values.
In a checkerboarded method, there's accumulated information stored in the form of render targets written earlier, while in a trained network there's an element of prior history being built into the weights in the network in addition to other data.

A network could find other correlations about a surface or specific properties that tend to behave similarly or are not perturbed within certain bounds.
Perhaps a number of local networks will pick up that a given area of rays tends to behave similarly, and that prior samples in time can combine with the learned behavior. Given the hardware's preference for SIMD layouts and array math, a good portion might like laying things out in a grid, and it might infer that a good approximation can be derived by their relationship with each along some kind of plane.
That's great but it would still fail in complex scenes with lots of depth discontinuity and motion. Unless their AI can predict the future.
 
That's great but it would still fail in complex scenes with lots of depth discontinuity and motion. Unless their AI can predict the future.
There is a level of artifacting and conservative fall-back that are part of the trade-offs in checkerboard rendering methods. Some of those methods could be applied, just in a more localized manner.
It's more of a joke about how often a network being trained is going to find correlations based on intersections and reconstructed points stored in a 2D grid.

A lower density of rays also plays havoc with the granularity of the GPU hardware, which opens up the prospect of "spare" capacity in alignment, fetch, and ALU throughput that can be filled less expensively with historical information or additional analytical work.
 
That's great but it would still fail in complex scenes with lots of depth discontinuity and motion. Unless their AI can predict the future.
It wouldn’t surprise me if some AI networks are really good at this. Whether or not it can run in real time is a bit of a different story.
 
That's great but it would still fail in complex scenes with lots of depth discontinuity and motion. Unless their AI can predict the future.
That's not far off of what the async spacewarp was already doing with VR. Where it would really smooth out animation with motion vectors. Stratifying the scene into various depths possibly addresses some occlusion issues and may even benefit the raytracing. The different depths could likely do with being updated/traced at different frequencies as well. Trees and mountains off in the distance shouldn't need updated as frequently as objects in focus. Especially in the case of VR where outer edges of the screen are blurred or rendered at lower resolution anyways.
 
Well, it sounds like we can expect to see tensor cores in consumer level graphics hardware. I’m not sure where a big perf jump over Pascal is going to come from if the process remains 16/12nm though - the main thing to cut from the gargantuan V100 die seems to be FP64 throughput, and it’s hard for me to imagine that freeing up so much room for big gains in lower precision shader power (at affordable die sizes). I guess the tensor cores on the consumer parts will probably be targeted at inference (e.g. int8 or fp8 instead of fp16)?
 
Well, it sounds like we can expect to see tensor cores in consumer level graphics hardware. I’m not sure where a big perf jump over Pascal is going to come from if the process remains 16/12nm though - the main thing to cut from the gargantuan V100 die seems to be FP64 throughput, and it’s hard for me to imagine that freeing up so much room for big gains in lower precision shader power (at affordable die sizes). I guess the tensor cores on the consumer parts will probably be targeted at inference (e.g. int8 or fp8 instead of fp16)?
Yeah.
A little while back I was surmising that they will have to implement Tensor cores on the GVx02 and GVx04 however named Tesla models as that is a tier of cards Nvidia also pushes heavily for Inferencing and replace either the P40 or P4 a company may had installed.
They would have different performance/efficiencies and so option along with price for buyers as the Tensor cores are limited in number per GPC.
I was not sure how much sense it would make to totally redo the GPU on Geforce without the Tensor cores rather than just disable possibly for now, especially when one also considers potential on Quadro.
 
HOCP did a fresh Titan V review, the card is on average 30% faster than 1080Ti now @4K, in some games like Kingdom Come, it's 44% faster. The Division, Doom, Deus Ex and Sniper Elite 4 is 33% faster. Overclocking easily adds 10~15% more performance. There are still some kinks though, some games flat out refuse to work @4K, and some still crashes. And performance @1440p suffers from CPU limitations.

https://www.hardocp.com/article/2018/03/20/nvidia_titan_v_video_card_gaming_review/5
 
Well, it sounds like we can expect to see tensor cores in consumer level graphics hardware. I’m not sure where a big perf jump over Pascal is going to come from if the process remains 16/12nm though - the main thing to cut from the gargantuan V100 die seems to be FP64 throughput, and it’s hard for me to imagine that freeing up so much room for big gains in lower precision shader power (at affordable die sizes). I guess the tensor cores on the consumer parts will probably be targeted at inference (e.g. int8 or fp8 instead of fp16)?

They could decide to increase die size from 470mm closer to the 600mm seen with P100 for their 'V'102 cards, while the 'V'104 move up to 470mm.
Also looking back at P100 relative to GP102 the FP64 cores increased the die by 29% and with a more compute orientated application such as Amber the 1080ti is around 9-14% faster (the Titan GP102 is a bit more faster again), very crude and simplistic I know
However a simplistic reduction of V100 is still not enough even for 610mm, and how expensive would it make the 'V'102 to be a mature wafer-die close to 600mm2, although yields should be reasonable now and with flexibility to disable at least 2SM.
 
IBM has come up with a nice optimised Machine Learning library specifically optimised for the Power9-NVLINK2-Volta V100 nodes
https://arxiv.org/pdf/1803.06333.pdf

Needs more validation but they are suggesting it is much faster than TensorFlow on their system, ridiculously faster while also making very good use of the NVLINK2 relative to PCIe.
The NVLINK2 performance is in section 5.4 Profiling of the paper.
Anyway explains nicely their approach to parallelism and also efficient use of sparse data structures.

Not going to replace Tensorflow, but some of the HPC implementations would probably be looking at it at some point.
 
Apparently there is a hardware bug affecting the Titan V in certain scientific applications, engineers tried running identical simulations repeatedly on the Titan V, but each time turned slightly different results. This is not a problem in TitanXP, although the author claims that older hardware sometimes had issues like this, that got resolved through patches.

Some people theorized it's a memory issue. Also apparently it doesn't affect gaming, though there is still a lot gaming instabilities with the Titan V.

https://www.theregister.co.uk/2018/03/21/nvidia_titan_v_reproducibility/
 
They are fortunate this happened before GTC :)
Regarding the memory though, it is not pushed that hard by Nvidia possibly due to the HBM2 thermal characteristics when it is.
Maybe comes down to as well having the Tensor Cores enabled when not using them and doing traditional FP32 compute, CUDA documentation linked earlier infers it uses the mixed precision mode when doing so in certain functions.
Possibly a behaviour they need to modify unless deliberately selected, if it is seen to be contributing to this problem.
 
Last edited:
Shame none of them contacted Amber to see if their results are accurate with the FP32 solvent benchmarks for V100.
I should had said regarding memory, was only commenting in context of the article talking about when memory is pushed, not if it is something else say ECC HBM2 function related.

But then why was it not identified with P100 if the issue is memory; that was running probably as close to the limits as V100 does with the more mature HBM2 production.
Maybe some kind of memory/cache related issue though due to changes such as Unified Memory amongst others but could be like already mentioned come back to feature behaviour when Tensor Cores are not disabled.
 
Last edited:
It wasn't made clear how significant the errors were, just how often they happened for some cards.
The magnitude of an error could be massive if it's random flips from bit 0 to 31 or 63.
If it's more constrained, perhaps a timing issue in specific operations' blocks or conversion logic might use the wrong precision or have errors in a specific subset of bits.

Perhaps downclocking and overvolting the silicon or memory can help narrow down instabilities in one place or the other.
 
It wasn't made clear how significant the errors were, just how often they happened for some cards.
The magnitude of an error could be massive if it's random flips from bit 0 to 31 or 63.
If it's more constrained, perhaps a timing issue in specific operations' blocks or conversion logic might use the wrong precision or have errors in a specific subset of bits.

Perhaps downclocking and overvolting the silicon or memory can help narrow down instabilities in one place or the other.
They made it sound like it was very reproducable on all the cards.
The HBM2 memory core is running lower than AMDs and well within spec from what we can tell.
Like I said the P100 was running closer to the edge with HBM2 (with what was available) but does not have the problems it seems.
 
They made it sound like it was very reproducable on all the cards.
The HBM2 memory core is running lower than AMDs and well within spec from what we can tell.
Like I said the P100 was running closer to the edge with HBM2 but does not have the problems it seems.
They specifically said it happened on 2 out of 4 cards, not that it was reproducable on all the cards. Only spots mentioning anything about reproducibility were the ones discussing how important reproducibility is for scientific calculations
 
They specifically said it happened on 2 out of 4 cards, not that it was reproducable on all the cards. Only spots mentioning anything about reproducibility were the ones discussing how important reproducibility is for scientific calculations
Your right 2 of the 4.
However it is reproducable as it happens 10% of the time on half their products (a small size for statistics I agree).
After repeated tests on four of the top-of-the-line GPUs, he found two gave numerical errors about 10 per cent of the time. These tests should produce the same output values each time again and again. On previous generations of Nvidia hardware, that generally was the case. On the Titan V, not so, we're told
......
It is not down to random defects in the chipsets nor a bad batch of products, since Nvidia has encountered this type of cockup in the past, we are told. .
But we have no more information than that compounded by that last sentence I quote, the P100 was running closer to the limits of memory for when it was available and not downclocked.
 
Last edited:
Back
Top