NVidia Ada Speculation, Rumours and Discussion

Status
Not open for further replies.
Most of extra Hopper power goes into FP64 and 18 NVLink 4 at 900GBps that Ada won't have.
Otherwise, if AD102 has same SM as Ampere, 18452 CUDA cores at 2.5Ghz gives 92 TFLOPS

AD102 has RT and more cache and all the triangle crunching machinery. Maybe it all cancels out but I would be impressed if it’s under 700mm^2.
 
I'll be surprised if AD102 will be bigger than GA102 which is 628mm^2.
Although this may explain the need for an AD103 straight away, with AD102 targeting markets with >$1500 RPPs only.
 
I'll be surprised if AD102 will be bigger than GA102 which is 628mm^2.
Although this may explain the need for an AD103 straight away, with AD102 targeting markets with >$1500 RPPs only.

Good point. A humongous AD102 would explain the need for AD103 at launch.
 
AMD's Infinity Cache curves for hit rate versus cache size are shown for 2MP, 4MP and 8MP render targets. When doubling+ the performance of a GPU, the curve for 2MP gaming no longer works, i.e. the curve for which the 2MP card was specified is too optimistic for the replacement card that's twice as fast. The same applies at 4MP and 8MP.

AMD isn't going to try to make 128MB of cache work for Navi 31. It's going for seemingly at least double that, 256MB.

NVidia, aiming for 96MB in the same performance category, looks problematic. The Semi Analysis article suggests that NVidia's L2 will run at around 6TB/s, a speed that could be faster than Infinity Cache in Navi 31. So, for hits AD102 will have an advantage (higher bandwidth coupled with a shallower cache hierarchy, meaning lower latency). But the hit rate in the 8MP gaming category will be far lower than the 53% shown in the article, because there will be more than twice as much work being done concurrently by the GPU.
 
AMD's Infinity Cache curves for hit rate versus cache size are shown for 2MP, 4MP and 8MP render targets. When doubling+ the performance of a GPU, the curve for 2MP gaming no longer works, i.e. the curve for which the 2MP card was specified is too optimistic for the replacement card that's twice as fast. The same applies at 4MP and 8MP.

AMD isn't going to try to make 128MB of cache work for Navi 31. It's going for seemingly at least double that, 256MB.

NVidia, aiming for 96MB in the same performance category, looks problematic. The Semi Analysis article suggests that NVidia's L2 will run at around 6TB/s, a speed that could be faster than Infinity Cache in Navi 31. So, for hits AD102 will have an advantage (higher bandwidth coupled with a shallower cache hierarchy, meaning lower latency). But the hit rate in the 8MP gaming category will be far lower than the 53% shown in the article, because there will be more than twice as much work being done concurrently by the GPU.
Yeah but they are still using a faster VRAM in the form of 24Gbps G6X too. So it may be a case where they miss more often but the penalty is lower.
 
Yeah but they are still using a faster VRAM in the form of 24Gbps G6X too. So it may be a case where they miss more often but the penalty is lower.
It would be interesting to see how latency varies with the count of memory channels and the bandwidth of those channels.

It may be that having more memory channels is the more advantageous factor.

EDIT: This article seems to suggest that memory channels have no benefit in GA102:

Measuring GPU Memory Latency – Chips and Cheese

and that overall Navi 21 and GA102 have the same memory latency.
 
Last edited:
NVidia, aiming for 96MB in the same performance category, looks problematic. The Semi Analysis article suggests that NVidia's L2 will run at around 6TB/s, a speed that could be faster than Infinity Cache in Navi 31. So, for hits AD102 will have an advantage (higher bandwidth coupled with a shallower cache hierarchy, meaning lower latency). But the hit rate in the 8MP gaming category will be far lower than the 53% shown in the article, because there will be more than twice as much work being done concurrently by the GPU.

That assumes AMDs hit rate stats are relevant for Nvidia’s memory pipeline. There are other factors that influence hit rates aside from size. Pinning, compression, pre-fetching, eviction policies etc. Nvidia also has caches local to the RT cores that help reduce the load on L2.

It makes more sense to compare AD102’s 96MB to the 6MB of GA102.
 
Almost back to Maxwell if that's true. Now just to coalesce pairs of subcores, combining SIMD width and bringing back the VLIW2 scheduler
 
The picture doesn’t really make sense. The ratio of control to compute seems unbalanced and each sub core having just a single 16-wide SIMD doesn’t increase peak SM throughput.

In a follow-up tweet he’s saying up to 200TF which doesn’t match the picture. Silly season is in full swing.
 
So the article with the updates on latency measurements doesn't seem to change the picture much in terms of the latency of global memory on Navi 21 and GA102. They're both about the same.

I think these articles are biased towards identifying the best case latency figures - which is kind of the opposite of the "worst case" that I was originally thinking about. So, I'm not convinced it's that much use.

That assumes AMDs hit rate stats are relevant for Nvidia’s memory pipeline. There are
other factors that influence hit rates aside from size. Pinning, compression, pre-fetching, eviction policies etc. Nvidia also has caches local to the RT cores that help reduce the load on L2.

It makes more sense to compare AD102’s 96MB to the 6MB of GA102.
There's no doubt that more cache will be better. I'm merely questioning whether it's enough.

The fact that estimates put the die size as being around 611mm² implies to me that NVidia's performance models show no need for more cache. Honestly, I'm still surprised. Cache is effectively free die space, as it offers fine-grained redundancy, and so should not affect yields.

Nvidia's semi tiled rasterisation approach should also do wonders for the L2$ hit rate
That's double-counting though, as it's something that Ampere already does.

The second article I linked regarding latency starts off with an analysis of Unigine Superposition "8K Optimized". The analysis concludes that latency is the problem that Ampere has:

"Loosely translated, Nvidia is saying that if the shaders are using less than 80% of their issue bandwidth while they have work to do, and the top stall reason is because warps were waiting on cache/memory, then you’re limited by cache/memory latency."

Is that purely texturing related?

A single SIMD for FP32/Int32 certainly solves the primary utilisation problems that Ampere's dual-SIMD layout introduces. It would also save a fair amount of die space.

The extra register files and scheduling/scoreboarding obviously eats in to the space saving, but it should still come out far ahead in terms of utilisation.

I like this speculation, which is supposedly founded upon stuff that can't be revealed.

The picture doesn’t really make sense. The ratio of control to compute seems unbalanced and each sub core having just a single 16-wide SIMD doesn’t increase peak SM throughput.
Within the SM there's twice as many of these partitions.

It's worth remembering that the register file in Ampere doesn't have the bandwidth to support, for example, FMA on both FP32 pipes simultaneously. So even if the instruction mix in the kernel would allow for two independent FMAs to be issued "simultaneously" the register file will say no. The partition then depends upon result forwarding, kept in the operand collector or somewhere else, to avoid the register file bandwidth crunch.
 
Within the SM there's twice as many of these partitions.
Yes but each is only 16-wide.

It's worth remembering that the register file in Ampere doesn't have the bandwidth to support, for example, FMA on both FP32 pipes simultaneously. So even if the instruction mix in the kernel would allow for two independent FMAs to be issued "simultaneously" the register file will say no. The partition then depends upon result forwarding, kept in the operand collector or somewhere else, to avoid the register file bandwidth crunch.

What makes you say that? Ampere needs to issue a single 32-wide warp per clock to keep both 16-wide SIMDs busy and it certainly has enough register bandwidth to do that.
 
Yes but each is only 16-wide.
Yes, so the peak FP32 throughput is unchanged per SM. As you observed earlier, it's not greater, but also, it's not less. It's still in agreement with various interpretations of leaks/rumours, 18,432 FP32 lanes.

And average throughput should be higher simply because of reduced instruction-dependency problems.

What makes you say that? Ampere needs to issue a single 32-wide warp per clock to keep both 16-wide SIMDs busy and it certainly has enough register bandwidth to do that.
I've failed to find the document that I believed described the problem, so for the time being I'll agree with you that this is irrelevant.
 
That new rumor from Kopite makes zero sense to me.
It's basically a regression back to Turing, and it's unbalanced as hell in tensor and RT capabilities.
Loads of flops don't exactly help Ampere much, why would they help Lovelace?
 
Status
Not open for further replies.
Back
Top