AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
WCCFTech was the first place where I saw 40 CUs for the initial Navi chip, but I generally don't visit the more obscure rumor/leak sites.

That said, I'm not sure how "guessable" 40 CUs were before the WCCFTech rumor last November. Due to the discreteness of CU counts, there aren't too many possibilities for a midrange Navi in 2019, and one list of reasonable possibilities for the CU count is 32, 36, 40, …, 60, 64 (although this is unavoidably with the influence of hindsight). If you randomly picked a CU count from the above list you would have a 11% chance of getting it right assuming that the real CU count was in the list to begin with. But someone who is guessing may consider some CU counts to be more likely to be true than others. If you asked me a year ago what my predictions for Navi CUs were, I would have predicted something closer to 64 and dismissed 40 as unlikely.

Too specific indeed, but they got the code wrong, calling it Navi12. Perhaps the AdoredTV 'leak' swayed them that way. They also get right that Navi will be a new uarch, which not many were expecting till AMD put it out as RDNA. They say the internal code was KUMA.

But they mention it being Vega56 in performance and that no Vega 7nm would show up for gamers. Amusingly, one of the top comments is about getting 2070 performance from AMD.

There's this bit about 'Navi10', which imo is Navi12

Navi 10 has either been scrapped or will follow later sometime in late 2019 or early 2020, depending on a couple of factors. The performance level of this part will be equivalent to Vega and it will be a small GPU based on 7nm.

Doesn't make much sense if the other Navi chip was to slot in Vega56 position.

According to Komachi, Navi14 is low end chip, so the Navi12 could be a bigger chip if not a small die too, and if it's not been scrapped.


I'm also curious if AMD can improve the density of their chips while keeping the clocks the same. Perhaps console chips can be denser and do lower clocks? And of course, if AMD are able to do 3SEs or not?
 
I'm also curious if AMD can improve the density of their chips while keeping the clocks the same. Perhaps console chips can be denser and do lower clocks?
I’ve thought a bit about this one, and the answer to the first question is probably that they can’t do a lot about density at the same clocks without either going to another litographic variation at 7nm (such as using EUV with SDB) or for instance changing memory type and thus memory controllers.
Whether AMD can increase density if they are prepared to lower frequencies, well, yes! But how much could they gain in density, and how much would they loose in frequency?

Apple for instance have a transistor density in their A12 SoC of 82.8 million/mm2 as opposed to their second gen 16nm FF A10 that had 26.4.
AMDs new Navi die has a transistor density of 41 million/mm2 (where the Vega10 die had a transistor density of 25.7 million/mm2 on GF 14nm). So while we are comparing mittens and gloves here, it does seem to suggest that the mobile variant is significantly denser, and that the HP scaling is a lot less. There are a number of reasons why HP designs should have a harder time scaling in density, but the specifics of just how much you would gain in density by making sacrifices in frequency is beyond this armchair expert.

I’ll say this though. The enthusiast desktop GPU market is where it really pays to push your silicon to its bleeding edge. The more performance you squeeze out, the more money you can ask for your product, observe how much more Nvidia can demand for the 30% or so that 2080Ti has in performance over the 2070Super.
And you don’t have to have any concerns about power supplies, or even coolers and noise - all of that is pushed to the end users and to some extent partner manufacturers. Whereas in consoles, the costs and consequences of needing larger power supplies, more expensive cooling, higher demands on ventilated placement, higher noise levels and consequently disgruntled customers that suffer failures due to trying to shut noisy devices away all fall squarely on the shoulders on the party, Sony or MS, that comissioned the design of the chip. It wouldn’t be surprising if a somewhat different compromise in size/density/frequency/power was struck.
But again, jwhat that would mean exactly is not clear.
 
Last edited:
It's not going to be just 10% under Vega 20, AMD showed their absolute best result with their weird Strange Brigade choice (which is not even a popular AA game), expect much worse real world results than this.
Well, NDA is lifted, Radeon RX 5700 XT offers 94 % performance of Vega 20 / Radeon VII:
https://www.computerbase.de/2019-07/radeon-rx-5700-xt-test/2/#abschnitt_benchmarks_in_2560__1440

Good luck next time. But I'm affraid that the audience of your purely negative "speculations" will be reduced significantly.
 
Any confirmation on the amount of LDS per CU/WGP? I can only find AMD slides that don't have numbers for it.
Some of the compute results seemed far off enough that I was curious if there's a wavefront occupancy issue with the LDS, or perhaps iffy register allocation choices due to the banked register file needing optimization.
More optimized games that use intrinsics may see regressions, depending on how they detect whether a given GPU is one that can use them or if they apply with a GFX10 GPU whose instruction behaviors may be subtly different.
 
AFAIK, it's unchanged per CU and two LDS' can be combined at the WGP level. I don't have anything in writing on it though and I did not ask too specific, i.e. there might be implications as to latency or occupancy between exclusive and inclusive modes.
 
Any confirmation on the amount of LDS per CU/WGP? I can only find AMD slides that don't have numbers for it.

They did specify 256 KB VGPR per full execution unit (64-lane wavefront), which is the same total size as in GCN (64KB x 4 SIMD16 vector units). This makes it twice as big per each SIMD32 vector unit (i.e. 128 KB x 2 vector units).

Slide 11 presents RDNA-era shader unit as "4 Scalar/SIMD32/SFU8" - this is actually RDNA "workgroup processor" (WGP), a group of 2 CUs (exactly 4 each of Scalar, SIMD32 and SFU8 blocks in total).

Then there is slide 12 in the same presentation where they compare x-rays of RDNA WGP/CU wih GCN CU. On this slide, RDNA LDS is pictured as shared across a WGP of 2 CUs, and its area is about two times as large in comparison to VGPR in each SIMD32 unit - which per above shall be 128 KB. On the contrary, GCN LDS has the same area as a single SIMD16 VGPR block, which were 64 KB each.


So I would say that LDS should be 4 times as big in total size as it was in GCN CU - i.e. 128 KB per each CU (with 2 SIMD32 vector units) and 256 KB per WGP (2 CUs and 4 SIMD32 vector units).
 
Last edited:
They did specify 256 KB VGPR per full execution unit (64-lane wavefront), which is the same total size as in GCN (64KB x 4 SIMD16 vector units). This makes it twice as big per each SIMD32 vector unit (i.e. 128 KB x 2 vector units).
In the context of the poor compute performance, I was curious if the early support for Navi's banked register file could have failed to spread register references across all banks.
An equivalent amount of register IDs allocated between a GCN and RDNA shader can experience more stalls in the latter due to bank conflicts, and possibly to some additional loss can occur if the compiler didn't handle the longer dependent instruction latency well.

Then there is slide 12 in the same presentation where they compare x-rays of RDNA WGP/CU wih GCN CU. On this slide, RDNA LDS is pictured as shared across a WGP of 2 CUs, and its area is about two times as large in comparison to VGPR in each SIMD32 unit - which per above shall be 128 KB. On the contrary, GCN LDS has the same area as a single SIMD16 VGPR block, which were 64 KB each.

So I would say that LDS should be 4 times as big in total size as it was in GCN CU - i.e. 128 KB per each CU (with 2 SIMD32 vector units) and 256 KB per WGP (2 CUs and 4 SIMD32 vector units).
This conflicts with the impression given to CarstenS, and there is some risk in using an conceptual diagram to guess at physical dimensions. Part of the reason why certain items are shown stretched twice as wide is that those items are shared. However, it's not a given that they're actually that size when shared, or that items like the scheduler and scalar unit blocks shrank as much as the diagram showed just because the artist didn't need to stretch them across the diagram.

Perhaps further details will come out if a white paper or ISA doc is released.

That diagram is also the one that I noted showed the branch/message block going missing in the GCN to RDNA transition without comment.
 
Good catch. It's the same in SiSoftware Sandra. Probably has something to do with the new CU grouping in RDNA. In GCN up to four Compute Units used to share the vector instruction and scalar caches. RDNA groups CUs by two, and LDS is now also a shared resource. It may well be that apart from those changes CUs in RDNA ended up sharing some of the front-end logic involved in instruction fetching and decoding.
 
Good catch. It's the same in SiSoftware Sandra. Probably has something to do with the new CU grouping in RDNA. In GCN up to four Compute Units used to share the vector instruction and scalar caches. RDNA groups CUs by two, and LDS is now also a shared resource. It may well be that apart from those changes CUs in RDNA ended up sharing some of the front-end logic involved in instruction fetching and decoding.
The RDNA diagrams indicate that there is a shared scalar cache and shared instruction cache. A slide also indicated that unlike the shared decode and issue units for GCN, there are separate decode and issue units for the vector path and also scalar.
No mention was made for the other instruction types, although the vector memory path could readily have duplicate decode/issue since there are two texture blocks and L0 caches.

The LDS is shared, so what that means in terms of decode and issue is uncertain.
The export, message, scalar memory, and other types are not mentioned and some may be more suitable to having shared decode and issue. There's still the one scalar cache, and elements like the export path are arbitrating for a common bus that wouldn't allow independent instruction issue anyway.
 
24 Navi CUs sounds like a Polaris 11 replacement.
Or rather a Polaris 11 sized chip with Polaris 10 performance.

Could be a very decent chip for laptops.

I'm surprised by the CU count. Underclocking would have to produce huge power efficiency gains for that many to run on most laptops, unless this is some gaming laptop specific GPU.
 
I'm surprised by the CU count. Underclocking would have to produce huge power efficiency gains for that many to run on most laptops, unless this is some gaming laptop specific GPU.
$1500 gaming laptops use the TU106, which the 40 CU Navi 10 competes with in graphics cards using similar performance and power consumption.
A 24 CU Navi would proportionally be closer to a TU116 which is going on $1000 laptops.

Though this could end up being exclusive to the rumored 16" macbook pro.
 

That's, a lot of codenames.

If AMD can easily scale their "double CU" count inside each block, which it seems like they can far more easily than they used to be able to with GCN, then we've little idea of what CU count each of these could even have. EG Navi21 is a differently scaled Navi 20 maybe?

Either way I do wonder if they can get more performance per mm by scaling up Cu count vs cache/bandwidth/Primitive shader engines. RDNA does catch up to Nvidia in titles Nvidia used to dominate on, seemingly geo heavy stuff like GTAV/Total War. But some compute heavy stuff like Raindbow Six Siege it falls a bit behind relative to other performance scaling. Maybe putting six double CU's per block rather than five would an overall win, or it could even scale to seven per if you want high end 4k performance. Samsung already produces 16gbps GDDR6, a 15% overclock from Navi 10, which could in turn accommodate 6 double CUs per block over 5, if bandwidth is a bottleneck.
 
Last edited:
I'm not sold on RDNA's supposed improved over GCN until computerbase do a comparison of 5700XT against 390 instead of every other chip in existence.
 
I'm not sold on RDNA's supposed improved over GCN until computerbase do a comparison of 5700XT against 390 instead of every other chip in existence.

Just curious, why 390 ? The RX480/580/590 has a similar number of CU's and generally outperforms the 390, doesn't it ?
 
Too specific indeed, but they got the code wrong, calling it Navi12. Perhaps the AdoredTV 'leak' swayed them that way. They also get right that Navi will be a new uarch, which not many were expecting till AMD put it out as RDNA. They say the internal code was KUMA.

But they mention it being Vega56 in performance and that no Vega 7nm would show up for gamers. Amusingly, one of the top comments is about getting 2070 performance from AMD.

There's this bit about 'Navi10', which imo is Navi12



Doesn't make much sense if the other Navi chip was to slot in Vega56 position.

According to Komachi, Navi14 is low end chip, so the Navi12 could be a bigger chip if not a small die too, and if it's not been scrapped.


I'm also curious if AMD can improve the density of their chips while keeping the clocks the same. Perhaps console chips can be denser and do lower clocks? And of course, if AMD are able to do 3SEs or not?

40 was easy to guess at, AMD had 20 CU single "Shader Engine" GPUs last year, a change over their 16CU ones already. That this was a change that'd extend to Navi, they had be doing ti for some reason, was easy to guess.

In fact, looking at the 6 double CU count per "shader engine" whatever that means in RDNA, for Navi 14 or whatever it is I'd easily guess any refresh/new chips will probably hit that. Looking at 4k and 1440p results of Navi10, that GPU actually drops relative to Nvidia's performance when going up in resolution. Meaning, most likely they're hitting compute/bandwidth bottlenecks rather than geo or work distribution bottlenecks. So moving each "SE" to 6 double compute units (from 5) and faster GDDR6 (if Bandwidth is bottlenecked anywhere) would give a good boost to 4k benchmark results compared to the relatively small increase in die size it'd incur.

It'd also mean adding just 2 more SE's, and a 384bit bus total, would compete with a 2080ti at 4k (at least at current clocks_. Which is very good for retail, as it'd need to curtail the clock speed much less than a theoretical 80 CU Navi card and compete at a smaller die size.
 
Last edited:
Just curious, why 390 ? The RX480/580/590 has a similar number of CU's and generally outperforms the 390, doesn't it ?

Because the 390 had 64ROPs and Polaris needed higher clocks. So the per-FLOPS comparison looks better for Navi vs. Polaris than vs. Hawaii.

40 was easy to guess at, AMD had 20 CU single "Shader Engine" GPUs last year, a change over their 16CU ones already. That this was a change that'd extend to Navi, they had be doing ti for some reason, was easy to guess.

In fact, looking at the 6 double CU count per "shader engine" whatever that means in RDNA, for Navi 14 or whatever it is I'd easily guess any refresh/new chips will probably hit that. Looking at 4k and 1440p results of Navi10, that GPU actually drops relative to Nvidia's performance when going up in resolution. Meaning, most likely they're hitting compute/bandwidth bottlenecks rather than geo or work distribution bottlenecks. So moving each "SE" to 6 double compute units (from 5) and faster GDDR6 (if Bandwidth is bottlenecked anywhere) would give a good boost to 4k benchmark results compared to the relatively small increase in die size it'd incur.

It'd also mean adding just 2 more SE's, and a 384bit bus total, would compete with a 2080ti at 4k (at least at current clocks_. Which is very good for retail, as it'd need to curtail the clock speed much less than a theoretical 80 CU Navi card and compete at a smaller die size.

40CU being the first navi chip as well, certainly looks like they had a decent source. I didn't pay much attention to that single shader engine chip, any pointers?

The reversal of AMD doing better at higher resolutions is certainly interesting, maybe bandwidth and the abnormally low pixel fillrate noted in AT review is the cause.
 
Status
Not open for further replies.
Back
Top