NVidia Ada Speculation, Rumours and Discussion

Jawed · Aug 6, 2021

If NVidia does the COPA thing:

NVIDIA COPA - Composable On-Package Architecture | Beyond3D Forum

then there could be a dense-AI and a dense-HPC version of Hopper.

Presumably Lovelace will be a third variant with balanced compute and AI.

troyan · Aug 6, 2021

CarstenS said:
So it's your assumption then.

No, its reality. There are physical limitions in chip design. One is data movement for calculations. Going from 22,4 FP32 TFLOPs to 44,5 FP64 TFLOPs is impossible. Using Matrix engines to deliver 44,5 FP64 TFLOPs TensorOps on the other hand is possible.

CarstenS · Aug 6, 2021

troyan said:
No, its reality. There are physical limitions in chip design. One is data movement for calculations. Going from 22,4 FP32 TFLOPs to 44,5 FP64 TFLOPs is impossible. Using Matrix engines to deliver 44,5 FP64 TFLOPs TensorOps on the other hand is possible.

Maybe that was Intels hybris too, not thinking of how you can sidestep those realities with chiplets. Especially if cost is not a primary factor. Each chip of MI200 only needs to get to half of the cited FLOP counts.

Nvidia seems to have realized this, judging from the paper jawed dug up. At least they are exploring this area. It just remains a question of cost and necessity, if they will/can employ this for Lovelace already.

JoeJ · Aug 6, 2021

Jawed said:
My belief is that the step change in per-pixel quality and consistency that ray tracing based techniques can bring will split devs into two camps: those that will continue as if nothing happened and those who are willing to go back to the drawing board. I expect that traditional fixed-function hardware will underlie this new perspective but it will be plugged-in to support ray-traced rendering. There'll be a lot of navel-gazing focused on what that hardware can really do when used properly.

I think we're exiting the "ray tracing is tacked-on" mode of graphics development. I'm hopeful that in a couple of years we'll play the fruits of this.

Ha ok, i thought your assumption was that RT would motivate devs to 'get optimization or performance right', which would not make so much sense.
Regarding camps i see those main topics:
* Question of cost / benefit ratio, which surely gets to much attention and heat.
* Learning the basics: Trying to model lighting correctly is not new to us, but things like importance sampling maybe are, and understanding related math is harder than the things we did before in realtime gfx. This makes RT less accessible to hobby / indie development for example. Some will not invest the time to learn it in all details for this reason.
* New optimization ideas related to games: Here we'll likely just adopt from the decades of work already done on the subject. The only really new field is aggressive denoising (which enables to use RT in realtime at all). The other option to help performance, optimized sampling strategies, is well researched from the offline guys. Still ongoing research ofc., but that's not really a topic related to HW or specifically about differences of realtime vs. offline.
* Improvement on APIs: A big one for me personally, since i can not use RT at all although i want to.

What i don't see is potential progress from 'using the HW properly' and experience, because there is not so much to research or try out, thanks to fixed function HW handling the costly parts.
It's different on consoles because AMDs RT leaves traversal to the shader cores, so it's programmable. But it seems NVs way is just better overall, and likely those options will disappear in the future.
Related discussion of traversal shaders on all future HW likely makes it come back at some time, but for efficiency we want something like a callback only on certain BVH nodes. We don't wont to handle the full traversal on our own.
And as this raises the question on which BVH nodes should do the callback, we likely want to open up BVH as well at the same time or before that.

So that's all future and potential stuff. At the moment our problem is not how to use the HW properly - it's just a TraceRay function. The function is ofc. expensive, but we can't do much to reduce it's cost, other than bundling similar rays to minimize divergence, which is nothing new.
But some people do have a problem with the important data structures (BVH) being not accessible. This could be solve on software side in form of API changes and providing vendor specifications of their data structures. It's about problems related to LOD, streaming BVH instead calculating it, or issues like animated foliage. So that's wher i expect the most progress coming from, if they get it right, or done at all.

troyan · Aug 6, 2021

CarstenS said:
Maybe that was Intels hybris too, not thinking of how you can sidestep those realities with chiplets. Especially if cost is not a primary factor. Each chip of MI200 only needs to get to half of the cited FLOP counts.

Nvidia seems to have realized this, judging from the paper jawed dug up. At least they are exploring this area. It just remains a question of cost and necessity, if they will/can employ this for Lovelace already.

nVidia has released a MCM paper years ago: https://research.nvidia.com/sites/default/files/publications/ISCA_2017_MCMGPU.pdf
COPA is not a MCM design.

A100 delivers up to 19,5 TFLOPs FP64 within 250W. P100 PCIe was able of 4,7 TFLOPs FP64 with 250W. So this is a 4x increase in performance going from 16nm -> 7nm and using TensorCores.
Now i should believe that AMD is doing the same within one year on the same process?

OlegSH · Aug 6, 2021

CarstenS said:
So it's your assumption then.

Is there a source for FP64 being done on vector units?
Bacause if there is no any confirmation of this, I'd rather follow the most technically plausible explanation, which is matrix multiply units since they require less BW.

DegustatoR · Aug 6, 2021

troyan said:
COPA is not a MCM design.

You could argue that it in fact is since it moves the MCs into separate dies. It's not an MCM in a sense of putting >1 compute die in a product but this is a minor thing really as the main idea of COPA is to build products from specialized dies according to market needs, and if the market will need a product with >1 compute die then it fits the idea well enough.

CarstenS · Aug 6, 2021

troyan said:
nVidia has released a MCM paper years ago: https://research.nvidia.com/sites/default/files/publications/ISCA_2017_MCMGPU.pdf
COPA is not a MCM design.

Even earlier - good for them! Regarding MCM: Forgive me, I was just going by the abstract of the Paper, which says "A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain." So, maybe depending on the definition, COPA can be an MCM?

troyan said:
A100 delivers up to 19,5 TFLOPs FP64 within 250W. P100 PCIe was able of 4,7 TFLOPs FP64 with 250W. So this is a 4x increase in performance going from 16nm -> 7nm and using TensorCores.
Now i should believe that AMD is doing the same within one year on the same process?

You are free to believe whatever you want of course. I was just interested if there's anything else to back up your claim, which you presented as if it were a fact.

We're at 11.5 TFLOPS FP64 with MI100 right now. With half rate FP64, Video-Engines and a die size of 750 mm² according to the TPU-Database. Just going full rate FP64 (and doing everything else as packed math), while loosing video-hardware would get us almost to the required TFLOPs rating. Then we have different flavors of TSMCs 7nm process, where N7+ for example supposedly yields a 20% density improvement, depending where you're coming from and that MI200 is not on 5 nm already. edit: We do not know the power consumption of MI200 yet. Maybe they can scale from 300 to 400 watts with ease, yielding another 15-20% clock speed increase? HPEs Cray XE is direct watercooling after all.
So FWIW, I do not think it unlikely that you're right, but neither do I think it's impossible that you're wrong.

OlegSH said:
Is there a source for FP64 being done on vector units?
Bacause if there is no any confirmation of this, I'd rather follow the most technically plausible explanation, which are matrix multiple units since they require less BW.

No. But then, I did not make this claim. In due discourse, you make a claim you prove it. Or you express that it is your belief, educated guess or whatever.
In case it wasn't clear, "so 5x A100 w/o Tensor cores." was of course referring to A100 w/o Tensor core, because your base would be 9.7, not 19.5.,

Kaotik · Aug 6, 2021

troyan said:
4x more FP64 over the compute units with a vector unit wont happen. You have to scale everything with it and alone data movement off- and on-chip will kill effciency.

That 4x is for double the CUs. Each CU is going from 1:2 to 1:1 FP64 and there's two chiplets with 128 CUs each, resulting double the CUs compared to M100 (though M100 has 8 CUs disabled, M200 might have some too, so talking in terms of full chips)

troyan · Aug 6, 2021

Kaotik said:
That 4x is for double the CUs. Each CU is going from 1:2 to 1:1 FP64 and there's two chiplets with 128 CUs each, resulting double the CUs compared to M100 (though M100 has 8 CUs disabled, M200 might have some too, so talking in terms of full chips)

That doesnt make any sense. They would have double everything and have to increase off chip bandwidth by 4x.

/edit: From GA100:

The new Double
Precision Matrix Multiply Add instruction on A100 replaces 8 DFMA instructions on V100,
reducing instruction fetches, scheduling overhead, register reads, datapath power, and shared
memory read bandwidth. Using Tensor Cores, each SM in A100 computes a total of 64 FP64
FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla
V100.

https://images.nvidia.com/aem-dam/e...ter/nvidia-ampere-architecture-whitepaper.pdf

JoeJ · Aug 6, 2021

OlegSH said:
I don't know what you find inefficient about RT.

Ofc. i mostly mean the cost on building BVH, which should be zero actually. You should know ;D
But beside that, it's not that we can do wonders with 1spp, and cranking up to 10spp still is no wonders.

OlegSH said:
Honestly, the problem is that your average console developer treats PC as a third tier platform and unless there is an IHV involved, which would help with the workloads you've mentioned, the developer won't do anything for PC.

Exactly. Thus my question if we really need monster GPUs.

OlegSH said:
I don't think that's the way to go. Those SS hacks, fancy SM based area shadow techniques, compute GI, etc are just another hacks with tons of drawbacks that are layered on top of the energy inefficient computations on general multiprocessors in a time when the Denard scaling is long dead, hence the 500W monsters on the horizon.

I tend to agree, but we could be wrong about some things. I know for example you are very wrong assuming compute GI has to be inaccurate or inefficient in comparison to RT GI. From what i see, it's the exact opposite of that

But yes, i do not like SM area shadow hacks or layered framebuffer stuff. I only consider it might do better in the end than all RT eventually, but i don't wan't to go there.
To me fluid sim and volumetric lighting remains most attractive to utilize monster GPU. It's expensive, but i don't see any magic to optimize it's brute force nature, and it would add to atmosphere.

OlegSH said:
Why bother making more of those fragile and unmaintainable systems

You know why: Because not any gaming system has capable RT power. What does it help if your artists no longer need to tweak lighting for the Enhanced version, when they still have to do it to support handheld stuff, or PCs if GPUs become so high price a majority can't update, but still buys games?
What we want is capable and affordable entry level. Without that, it will take even longer until we can put RT on min. specs.
We surely don't disagree on that. But i may be too pessimistic regarding 'high end craze'. We'll see what happens...

trinibwoy · Aug 6, 2021

JoeJ said:
Trying to model lighting correctly is not new to us, but things like importance sampling maybe are, and understanding related math is harder than the things we did before in realtime gfx. This makes RT less accessible to hobby / indie development for example. Some will not invest the time to learn it in all details for this reason.

Sorry but this is objectively false. The current state of the art hacks in real-time graphics are incredibly complex. Not only is the math complex but the concepts themselves are often abstract. The only reason these hacks are accessible to indie devs is that they come for free in game engines that now benefit from many years of R&D on these techniques. It’s not because the math is easy.

On the other hand RT conceptually is incredibly simple to understand. Optimizations like importance sampling increase complexity of course but no more so than the multitude of hacks we use today like VSMs, SDFs or SSAO.

Jawed · Aug 6, 2021

JoeJ said:
Ha ok, i thought your assumption was that RT would motivate devs to 'get optimization or performance right', which would not make so much sense.

I think the keenest subset of devs will do precisely that. There's not many of them. AAA graphics continues to be an exclusive preserve.

Regarding camps i see those main topics:
* Question of cost / benefit ratio, which surely gets to much attention and heat.
* Learning the basics: Trying to model lighting correctly is not new to us, but things like importance sampling maybe are, and understanding related math is harder than the things we did before in realtime gfx. This makes RT less accessible to hobby / indie development for example. Some will not invest the time to learn it in all details for this reason.

They're basically irrelevant.

* New optimization ideas related to games: Here we'll likely just adopt from the decades of work already done on the subject. The only really new field is aggressive denoising (which enables to use RT in realtime at all). The other option to help performance, optimized sampling strategies, is well researched from the offline guys. Still ongoing research ofc., but that's not really a topic related to HW or specifically about differences of realtime vs. offline.

It appears the rate of change/improvement in real time ray tracing techniques has gone exponential and that history is merely a baseline.

* Improvement on APIs: A big one for me personally, since i can not use RT at all although i want to.

I'm thinking you're just cutting off your nose to spite your face.

What i don't see is potential progress from 'using the HW properly' and experience, because there is not so much to research or try out, thanks to fixed function HW handling the costly parts.
It's different on consoles because AMDs RT leaves traversal to the shader cores, so it's programmable. But it seems NVs way is just better overall, and likely those options will disappear in the future.
Related discussion of traversal shaders on all future HW likely makes it come back at some time, but for efficiency we want something like a callback only on certain BVH nodes. We don't wont to handle the full traversal on our own.
And as this raises the question on which BVH nodes should do the callback, we likely want to open up BVH as well at the same time or before that.

I expect console devs have full transparency of the BVHs and their problem will be to keep ancillary data structures aligned with the BVHs.

So that's all future and potential stuff. At the moment our problem is not how to use the HW properly - it's just a TraceRay function. The function is ofc. expensive, but we can't do much to reduce it's cost, other than bundling similar rays to minimize divergence, which is nothing new.

My opinion is that the TraceRay function backed by real time conventional hardware will be transformative. Whether it's rasterisation, depth, atomics or mesh shaders, the sum will be far greater than the parts - once devs think about accelerating rays instead of "adding quality to rasterisation with some rays".

JoeJ · Aug 6, 2021

trinibwoy said:
Sorry but this is objectively false. The current state of the art hacks in real-time graphics are incredibly complex. Not only is the math complex but the concepts themselves are often abstract. The only reason these hacks are accessible to indie devs is that they come for free in game engines that now benefit from many years of R&D on these techniques. It’s not because the math is easy.

On the other hand RT conceptually is incredibly simple to understand. Optimizations like importance sampling increase complexity of course but no more so than the multitude of hacks we use today like VSMs, SDFs or SSAO.

Ok, i see it's a matter of opinion and perspective. I was referring to old days of traditional raterization, but i see this no longer really holds.
What i meant was understanding a basic rasterizer like Quake is simpler than say a path tracer using monte carlo and IS. The former has more complexity in code, but i found the math behind the latter more advanced.
Off the shelf engines likely help with RT adoption even more than with modern rasterization practices.

Kaotik · Aug 6, 2021

troyan said:
That doesnt make any sense. They would have double everything and have to increase off chip bandwidth by 4x.

/edit: From GA100:

https://images.nvidia.com/aem-dam/e...ter/nvidia-ampere-architecture-whitepaper.pdf

They have 4x 16 GB HBM2e per chiplet for a total of 128 GB HBM2 behind 8192-bit membus for that.

DegustatoR · Aug 6, 2021

JoeJ said:
Off the shelf engines likely help with RT adoption even more than with modern rasterization practices.

I dunno about this one. All such engines provide rasterization fall backs for all RT use cases they have. This limits them in how they can use RT as any form of RT usage which can't be handled with rasterization is out of their scope for now. An engine which would target RT h/w as a minimum requirement will likely be a lot more interesting in its RT usage then these 3rd party engines.

JoeJ · Aug 6, 2021

DegustatoR said:
I dunno about this one. All such engines provide rasterization fall backs for all RT use cases they have. This limits them in how they can use RT as any form of RT usage which can't be handled with rasterization is out of their scope for now. An engine which would target RT h/w as a minimum requirement will likely be a lot more interesting in its RT usage then these 3rd party engines.

Sure. We don't have enough RT GPUs out there to make any RT only games yet?
Currently, giving RT support at all to many games is a huge step forward, which would not be possible without U engines.

DegustatoR · Aug 6, 2021

JoeJ said:
Sure. We don't have enough RT GPUs out there to make any RT only games yet?
Currently, giving RT support at all to many games is a huge step forward, which would not be possible without U engines.

Has more to do with the need to support old consoles like XBO.

RT in engines like UE4 increase the number of titles which use RT h/w but I'd say that on average does more damage to RT image than benefit. Most games which make use of RT in such titles do it like a checkbox feature which eats a lot of performance while not providing that much improvement over the baseline raster option (like SSR, etc.) A game which will make extensive use of RT will likely implement its capabilities into gameplay making the raster fallback either impossible or so slow and buggy that it simply wouldn't work.

UE5 is an interesting beast from this POV as it seem to sit in-between previous gen h/w and next gen RT GPUs. It will be interesting to see how Lumen will evolve over the next several years.

JoeJ · Aug 6, 2021

DegustatoR said:
RT in engines like UE4 increase the number of titles which use RT h/w but I'd say that on average does more damage to RT image than benefit. Most games which make use of RT in such titles do it like a checkbox feature which eats a lot of performance while not providing that much improvement over the baseline raster option (like SSR, etc.) A game which will make extensive use of RT will likely implement its capabilities into gameplay making the raster fallback either impossible or so slow and buggy that it simply wouldn't work.

Now that you say that - i thought the same early on. I was not really impressed from UE 'bolt on' effects and wanted more. Then i saw results are pretty good, and support of effects is quite complete, but i still assumed we could get at least better perf. from reworking engines with RT in mind.
And meanwhile we got this too. To me Exodus is the perfect example of RT done right. Though, if i compere side by side with initial non RT version, i realize my own expectations on what RT would enable were too high. I fell into this mistake of having unrealistic expectations although i was very a critical about HW RT from the beginning. Now i think Exodus still is limited by artwork not made for RT. Some more wow should be possible. But i realize: As a gfx developer, i can't do wonders anymore. I can fix all those flaws and issues, but i can't improve the things that already worked most of the time oh so much. Even if we improve it to the point where we have Marbles demo gfx in games, my jaw would not drop. To get impressive results, which is still possible, it really seems artwork and content is the most powerful tool now. The dream about photorealism is no longer that exciting.

That's no bad thing. But it changes a lot for me. E.g. the conclusion of less powerful HW being good enough, suiting the current situation of crisis better while still providing good visuals.
But also: With lower expectations, i don't agree UE4 RT does any harm. Any RT support is nice to have for those having a RT GPU.
For example, i love the game Amid Evil which is a UE4 indie retro shooter. And it got RT patch and a new world designed for RT. Reviews were meh - just reflections, high pref cost. But i'll be happy for sure, and will save up the new world until i have RT GPU, although i want to play it already now.
I'll also take a look at Exodus and CP again, but those games are not really to my liking, so the technically inferior UE4 game will be the most exciting RT experience for me, because i like this game.

That's an important point, no? Maybe the most important. We are just too geeky all the day

Bondrewd · Aug 6, 2021

troyan said:
That doesnt make any sense. They would have double everything and have to increase off chip bandwidth by 4x.

Oh boy.
They did!

NVidia Ada Speculation, Rumours and Discussion

Jawed

troyan

CarstenS

Moderator

JoeJ

troyan

OlegSH

DegustatoR

CarstenS

Moderator

Kaotik

Drunk Member

troyan

JoeJ

trinibwoy

Meh

Jawed

JoeJ

Kaotik

Drunk Member

DegustatoR

JoeJ

DegustatoR

JoeJ

Bondrewd

Similar threads