NVidia Ada Speculation, Rumours and Discussion

trinibwoy · May 4, 2022

xpea said:
Most of extra Hopper power goes into FP64 and 18 NVLink 4 at 900GBps that Ada won't have.
Otherwise, if AD102 has same SM as Ampere, 18452 CUDA cores at 2.5Ghz gives 92 TFLOPS

AD102 has RT and more cache and all the triangle crunching machinery. Maybe it all cancels out but I would be impressed if it’s under 700mm^2.

DegustatoR · May 4, 2022

I'll be surprised if AD102 will be bigger than GA102 which is 628mm^2.
Although this may explain the need for an AD103 straight away, with AD102 targeting markets with >$1500 RPPs only.

xpea · May 4, 2022

trinibwoy said:
AD102 has RT and more cache and all the triangle crunching machinery. Maybe it all cancels out but I would be impressed if it’s under 700mm^2.

I put again the link below to the excellent Dylan/Locouza article where they come to 611mm2 for an hypothetical AD102 based on the current rumours:
https://semianalysis.substack.com/p/nvidia-ada-lovelace-leaked-specifications?s=r

trinibwoy · May 4, 2022

DegustatoR said:
I'll be surprised if AD102 will be bigger than GA102 which is 628mm^2.
Although this may explain the need for an AD103 straight away, with AD102 targeting markets with >$1500 RPPs only.

Good point. A humongous AD102 would explain the need for AD103 at launch.

Jawed · May 4, 2022

AMD's Infinity Cache curves for hit rate versus cache size are shown for 2MP, 4MP and 8MP render targets. When doubling+ the performance of a GPU, the curve for 2MP gaming no longer works, i.e. the curve for which the 2MP card was specified is too optimistic for the replacement card that's twice as fast. The same applies at 4MP and 8MP.

AMD isn't going to try to make 128MB of cache work for Navi 31. It's going for seemingly at least double that, 256MB.

NVidia, aiming for 96MB in the same performance category, looks problematic. The Semi Analysis article suggests that NVidia's L2 will run at around 6TB/s, a speed that could be faster than Infinity Cache in Navi 31. So, for hits AD102 will have an advantage (higher bandwidth coupled with a shallower cache hierarchy, meaning lower latency). But the hit rate in the 8MP gaming category will be far lower than the 53% shown in the article, because there will be more than twice as much work being done concurrently by the GPU.

DegustatoR · May 4, 2022

Jawed said:
AMD's Infinity Cache curves for hit rate versus cache size are shown for 2MP, 4MP and 8MP render targets. When doubling+ the performance of a GPU, the curve for 2MP gaming no longer works, i.e. the curve for which the 2MP card was specified is too optimistic for the replacement card that's twice as fast. The same applies at 4MP and 8MP.

AMD isn't going to try to make 128MB of cache work for Navi 31. It's going for seemingly at least double that, 256MB.

NVidia, aiming for 96MB in the same performance category, looks problematic. The Semi Analysis article suggests that NVidia's L2 will run at around 6TB/s, a speed that could be faster than Infinity Cache in Navi 31. So, for hits AD102 will have an advantage (higher bandwidth coupled with a shallower cache hierarchy, meaning lower latency). But the hit rate in the 8MP gaming category will be far lower than the 53% shown in the article, because there will be more than twice as much work being done concurrently by the GPU.

Yeah but they are still using a faster VRAM in the form of 24Gbps G6X too. So it may be a case where they miss more often but the penalty is lower.

Jawed · May 4, 2022

DegustatoR said:
Yeah but they are still using a faster VRAM in the form of 24Gbps G6X too. So it may be a case where they miss more often but the penalty is lower.

It would be interesting to see how latency varies with the count of memory channels and the bandwidth of those channels.

It may be that having more memory channels is the more advantageous factor.

EDIT: This article seems to suggest that memory channels have no benefit in GA102:

Measuring GPU Memory Latency – Chips and Cheese

and that overall Navi 21 and GA102 have the same memory latency.

Jawed · May 4, 2022

A follow-up article:

GPU Memory Latency’s Impact, and Updated Test – Chips and Cheese

I'm off to the pub, so you lot can report back with an executive summary in the meantime :mrgreen:

trinibwoy · May 4, 2022

Jawed said:
NVidia, aiming for 96MB in the same performance category, looks problematic. The Semi Analysis article suggests that NVidia's L2 will run at around 6TB/s, a speed that could be faster than Infinity Cache in Navi 31. So, for hits AD102 will have an advantage (higher bandwidth coupled with a shallower cache hierarchy, meaning lower latency). But the hit rate in the 8MP gaming category will be far lower than the 53% shown in the article, because there will be more than twice as much work being done concurrently by the GPU.

That assumes AMDs hit rate stats are relevant for Nvidia’s memory pipeline. There are other factors that influence hit rates aside from size. Pinning, compression, pre-fetching, eviction policies etc. Nvidia also has caches local to the RT cores that help reduce the load on L2.

It makes more sense to compare AD102’s 96MB to the 6MB of GA102.

Qesa · May 4, 2022

Nvidia's semi tiled rasterisation approach should also do wonders for the L2$ hit rate

vola · May 5, 2022

https://twitter.com/x/status/1522064659976663040

Qesa · May 5, 2022

Almost back to Maxwell if that's true. Now just to coalesce pairs of subcores, combining SIMD width and bringing back the VLIW2 scheduler

trinibwoy · May 5, 2022

The picture doesn’t really make sense. The ratio of control to compute seems unbalanced and each sub core having just a single 16-wide SIMD doesn’t increase peak SM throughput.

In a follow-up tweet he’s saying up to 200TF which doesn’t match the picture. Silly season is in full swing.

Jawed · May 5, 2022

So the article with the updates on latency measurements doesn't seem to change the picture much in terms of the latency of global memory on Navi 21 and GA102. They're both about the same.

I think these articles are biased towards identifying the best case latency figures - which is kind of the opposite of the "worst case" that I was originally thinking about. So, I'm not convinced it's that much use.

trinibwoy said:
That assumes AMDs hit rate stats are relevant for Nvidia’s memory pipeline. There are
other factors that influence hit rates aside from size. Pinning, compression, pre-fetching, eviction policies etc. Nvidia also has caches local to the RT cores that help reduce the load on L2.

It makes more sense to compare AD102’s 96MB to the 6MB of GA102.

There's no doubt that more cache will be better. I'm merely questioning whether it's enough.

The fact that estimates put the die size as being around 611mm² implies to me that NVidia's performance models show no need for more cache. Honestly, I'm still surprised. Cache is effectively free die space, as it offers fine-grained redundancy, and so should not affect yields.

Qesa said:
Nvidia's semi tiled rasterisation approach should also do wonders for the L2$ hit rate

That's double-counting though, as it's something that Ampere already does.

The second article I linked regarding latency starts off with an analysis of Unigine Superposition "8K Optimized". The analysis concludes that latency is the problem that Ampere has:

"Loosely translated, Nvidia is saying that if the shaders are using less than 80% of their issue bandwidth while they have work to do, and the top stall reason is because warps were waiting on cache/memory, then you’re limited by cache/memory latency."

Is that purely texturing related?

vola said:
https://twitter.com/x/status/1522064659976663040

A single SIMD for FP32/Int32 certainly solves the primary utilisation problems that Ampere's dual-SIMD layout introduces. It would also save a fair amount of die space.

The extra register files and scheduling/scoreboarding obviously eats in to the space saving, but it should still come out far ahead in terms of utilisation.

I like this speculation, which is supposedly founded upon stuff that can't be revealed.

trinibwoy said:
The picture doesn’t really make sense. The ratio of control to compute seems unbalanced and each sub core having just a single 16-wide SIMD doesn’t increase peak SM throughput.

Within the SM there's twice as many of these partitions.

It's worth remembering that the register file in Ampere doesn't have the bandwidth to support, for example, FMA on both FP32 pipes simultaneously. So even if the instruction mix in the kernel would allow for two independent FMAs to be issued "simultaneously" the register file will say no. The partition then depends upon result forwarding, kept in the operand collector or somewhere else, to avoid the register file bandwidth crunch.

trinibwoy · May 5, 2022

Jawed said:
Within the SM there's twice as many of these partitions.

Yes but each is only 16-wide.

It's worth remembering that the register file in Ampere doesn't have the bandwidth to support, for example, FMA on both FP32 pipes simultaneously. So even if the instruction mix in the kernel would allow for two independent FMAs to be issued "simultaneously" the register file will say no. The partition then depends upon result forwarding, kept in the operand collector or somewhere else, to avoid the register file bandwidth crunch.

What makes you say that? Ampere needs to issue a single 32-wide warp per clock to keep both 16-wide SIMDs busy and it certainly has enough register bandwidth to do that.

Jawed · May 5, 2022

trinibwoy said:
Yes but each is only 16-wide.

Yes, so the peak FP32 throughput is unchanged per SM. As you observed earlier, it's not greater, but also, it's not less. It's still in agreement with various interpretations of leaks/rumours, 18,432 FP32 lanes.

And average throughput should be higher simply because of reduced instruction-dependency problems.

What makes you say that? Ampere needs to issue a single 32-wide warp per clock to keep both 16-wide SIMDs busy and it certainly has enough register bandwidth to do that.

I've failed to find the document that I believed described the problem, so for the time being I'll agree with you that this is irrelevant.

DegustatoR · May 5, 2022

That new rumor from Kopite makes zero sense to me.
It's basically a regression back to Turing, and it's unbalanced as hell in tensor and RT capabilities.
Loads of flops don't exactly help Ampere much, why would they help Lovelace?

Man from Atlantis · May 5, 2022

Maybe it's not 4*FP32 but 2*CUDA clock is back :runaway:

TopSpoiler · May 6, 2022

https://twitter.com/x/status/1522253825217908738

Kopite confirmed DSMEM.

https://twitter.com/x/status/1522311291901300742

People are expecting double dispatch units.

trinibwoy · May 6, 2022

So where’s the secret sauce? kopite has been spraying and praying presumably based on some hint. Distributed shared mem is hardly a game changer for graphics. Also, increasing the number of active warps would require a commensurate increase in register file and L1 capacity.

Latest bleat:

https://twitter.com/x/status/1522519294340345857

NVidia Ada Speculation, Rumours and Discussion

trinibwoy

Meh

DegustatoR

xpea

trinibwoy

Meh

Jawed

DegustatoR

Jawed

Jawed

trinibwoy

Meh

Qesa

vola

Qesa

trinibwoy

Meh

Jawed

trinibwoy

Meh

Jawed

DegustatoR

Man from Atlantis

idk

TopSpoiler

trinibwoy

Meh

Similar threads