AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
Yea but even then, usually the cut down chips are at most 20% less performance but that still leaves a massive gap between the tiers. A cut down 80 CU would probably still be 72CUs or close to it which is still massively above a 40CU part.
Top end card could already be 72 CU with a cut down one being 64.
 
With the 3 Navi2 GPUs, AMD seems to have a lot of space between their tiers.

Navi 21 with 80 CUs then Navi 22 with 40 CUs then Navi 23 with 20? CUs.

This is a lot of performance difference between the GPU tiers.

With only 3 dies I'd expect a good a large discrepancy. It's no different from Ampere, the 3080 should be between 50-70% faster than a 3070.

The bizarre claims on the bus width now make even less sense now though. If there's some magic ed-ram for "big navi" why wouldn't it be used for the smaller dies as well, just scaled down relevantly?
 
I ran across a link from Anandtech's forum that points to a paper by the individuals involved in the patent concerning this sort of scheme:
https://adwaitjog.github.io/docs/pdf/sharedl1-pact20.pdf

[...]

edit: Actually, I just ran across a mention of the coherence mechanism assumed by the analysis, and it's Nvidia's L1 flush at kernel or synchronization boundaries. That could significantly deviate from the expected behavior of a GCN/RDNA cache.
I think your criticism is on-point. AMD piggy-backs on research by people who normally work on CUDA.

Our baseline architecture assumes a generic GPU, consisting of multiple cores (also called Compute Units, or CUs) that have private local L1 caches. These caches are connected to multiple address-sliced L2 cache banks via a NoC
This also sounds like "GCN" to me, though the "sizes" remind me of NVidia.

It's an odd arrangement with the ROPs versus the L1. The RDNA whitepaper considers the ROPs clients of the L1 and touts how it reduces memory traffic. However, if their output is considered write-through, what savings would making them an L1 client bring?
Given screen-space tiling, a given tile will not be loaded by any other RBE but one (no sharing), and unless the RBE loads a tile and proceeds to not write anything to it, the L1 at best holds ROP data that is read once and must be evicted once it leaves the RBE caches (no reuse).
I guess this simply refects the compression ratios: render target pixels consume way more cache space than compressed texels. Therefore L1 is biased towards texels.

A tenet of RDNA is that L2s "only talk to L1s" (though L2s would also need to support clients such as the command processor and media units like h.264 decode). This means the only way the ROPs can see memory is via L1.

The whitepaper doesn't talk about writes to memory originated by compute kernels (unless I've missed/forgotten) and these would have the same problem. Effectively, it seems, all writes from shader engines are L1-write-through-L2.

If L1 lines are "address-locked" (no replication anywhere in the GPU at L1 level) then it would seem that compute kernels that scatter or kernels that do a lot of random read-write to memory will benefit as L1 effectively grows in size, even if the latencies are more random than would be seen with replicated L1 lines. In the case of these kernels, replicated L1 lines would, if they are to be re-used at all tend to suffer from invalidations, i.e. the use:invalidation ratio would be "low".
 
GL1 is bypassed for all CU atomic and write requests (ref: ISA documentation), and any affected GL1 line is invalidated (ref: whitepaper). So GL1 lines are brought in only by non-coherent read misses, and coherent ones would skip GL1 entirely.

Read-after-write performance does not seem like a primary design goal of the GCN/RDNA cache hierarchy after all. For example, writes can stay in L0 only when it dirties an entire cache line. So scattered reads and writes effectively ends up being served by L2, say even for one wavefronts, or multiple wavefronts in the same workgroup (barrier synchronised).

Btw, these are the only two legit read-after-write scenarios that I know of, since the only sane sync primitive outside the workgroup is atomics (aside from graphics related ones), which always skips GL1 for L2. Write combining has always been the focus, and the only paper I read that echoes this focus was QuickRelease (2014).

GL1 (aside from its L2 request arbitrator role) does seem to be mostly amplifying bandwidth for constant, texture and instruction loads.
 
Last edited:
With only 3 dies I'd expect a good a large discrepancy. It's no different from Ampere, the 3080 should be between 50-70% faster than a 3070.

The bizarre claims on the bus width now make even less sense now though. If there's some magic ed-ram for "big navi" why wouldn't it be used for the smaller dies as well, just scaled down relevantly?

Did you mean 3060?
 
Did you mean 3060?

Nope

3080 has 70% more bandwidth than a 3070 which is stuck at 2080 levels, and the card is very much bandwidth limited in certain title like Flight Simulator. The 50% should be for titles that aren't like Doom Eternal.

And I do wonder if the spread on the Navi cards will be quite similar. Depending on how fast the big one is it could well be worse, if the big one is faster than a 3090 while midrange is stuck somewhere near an overclocked 5700xt.
 
Nope

3080 has 70% more bandwidth than a 3070 which is stuck at 2080 levels, and the card is very much bandwidth limited in certain title like Flight Simulator. The 50% should be for titles that aren't like Doom Eternal.

And I do wonder if the spread on the Navi cards will be quite similar. Depending on how fast the big one is it could well be worse, if the big one is faster than a 3090 while midrange is stuck somewhere near an overclocked 5700xt.
Nvidia claimed the 3070 is faster than a 2080ti. I cant see a 3080 being 50-70% faster.
 
Nvidia claimed the 3070 is faster than a 2080ti.
Yes and no. Verbally yes it was claimed it's faster. But on nvidia's own presentation slide the 3070 was actually "only" exactly as fast as the 2080ti. My interpretation would be that you have to include quite some titles which use rtx to get the same average performance as the 2080ti...
 
Yes and no. Verbally yes it was claimed it's faster. But on nvidia's own presentation slide the 3070 was actually "only" exactly as fast as the 2080ti. My interpretation would be that you have to include quite some titles which use rtx to get the same average performance as the 2080ti...
I dont doubt the 2080ti being faster in various scenarios. It’s highly unlikely the 3070 ends up around 2080/2080s performance level, which is where it would need to fall for the 3080 to be 50-70% faster.
 
Nvidia claimed the 3070 is faster than a 2080ti. I cant see a 3080 being 50-70% faster.

Yes and no. Verbally yes it was claimed it's faster. But on nvidia's own presentation slide the 3070 was actually "only" exactly as fast as the 2080ti. My interpretation would be that you have to include quite some titles which use rtx to get the same average performance as the 2080ti...


Galax puts 3070 slower than 2080 Ti, but what does any of this have to do with Navi?
 
Nope

3080 has 70% more bandwidth than a 3070 which is stuck at 2080 levels, and the card is very much bandwidth limited in certain title like Flight Simulator. The 50% should be for titles that aren't like Doom Eternal.

And I do wonder if the spread on the Navi cards will be quite similar. Depending on how fast the big one is it could well be worse, if the big one is faster than a 3090 while midrange is stuck somewhere near an overclocked 5700xt.

If overclocked 5700xt, you mean ~20% faster, aka rtx2080/super levels, then sure.
 
This also sounds like "GCN" to me, though the "sizes" remind me of NVidia.
At the level of L1s with per-channel L2 slices, the architectures generally agree. However, I may have mis-remembered which GPU the paper's analysis resembled more. I think the 2x16 SIMD arrangement, register file, separate texture caches, and 16KB/48KB L1/scratchpad split might be closer to Fermi.

I guess this simply refects the compression ratios: render target pixels consume way more cache space than compressed texels. Therefore L1 is biased towards texels.
The L1 is described as being read-only, and elsewhere was presented as only holding data decompressed from DCC. Since the RDNA whitepaper indicated the L1 would absorb a lot of their traffic, how much traffic can it absorb since so much of the ROP traffic bypasses it?

A tenet of RDNA is that L2s "only talk to L1s" (though L2s would also need to support clients such as the command processor and media units like h.264 decode). This means the only way the ROPs can see memory is via L1.
Another presentation had the DCC hardware absorbing writes from the CUs and RBEs, where the RBEs would writing to the compressor, which is linked to the L2 since the L1 is uncompressed.

GL1 (aside from its L2 request arbitrator role) does seem to be mostly amplifying bandwidth for constant, texture and instruction loads.
This would seemingly benefit the RBEs, but the major limiters described for them is related to write-heavy traffic. Is there still significant bandwidth amplification for ROP traffic with a read-only L1?
 
Another presentation had the DCC hardware absorbing writes from the CUs and RBEs, where the RBEs would writing to the compressor, which is linked to the L2 since the L1 is uncompressed.

UAV writes maintaining DCC are now supported, so it's not only the RBEs.
 
This would seemingly benefit the RBEs, but the major limiters described for them is related to write-heavy traffic. Is there still significant bandwidth amplification for ROP traffic with a read-only L1?
In theory it could act as a victim cache of the RBE caches to exploit temporal & spatial locality of exports. This seems possible especially since each RBE owns their screen space partitions, and therefore cache coherency for render targets across shader engines is presumably a non issue.

This would however assume that the GL1 cache controller supports cache policies exclusive to the RBEs, so that dirty lines from RBE caches can stay in GL1 (write allocate) while they are on their way to L2. This definitely doesn't exist in the CU ISA anyway, which made it quite clear that all atomics and writes bypass GL1 (write no-allocate + invalidation).
 
Last edited:
Status
Not open for further replies.
Back
Top