AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
I'm trying to be realistic about the use of die area. The only alternative being rumoured for the huge missing area is "massive cache". You have something better? Or do you think it's a monster cache?
I'm thinking that if Navi 21 is actually more or less on par with GA102 - and all signs point to this, if only for the XTX config - then there must be something in addition to the rumored 256 bit bus and it's unlikely to be magic.
It's also possible that reaching higher clocks and improving power consumption could result in some die area increases.
But suggesting that they somehow crammed 80 WGPs - or even twice the SIMDs per WGP - while staying inside 300W and on the same process and reaching 2.4GHz clocks isn't realistic at all.
When you're saying that NV did it you're forgetting that Turing actually had these lanes already, they just weren't capable of FP32. So not the same situation.
 
It's also possible that reaching higher clocks and improving power consumption could result in some die area increases.
There's no sign of such in XSX die though. PS5 die (based on approximate area) shows no sign of a gross die size increase versus Navi 10 or Navi 14.

There's 30%+ die area that can't be explained!

But suggesting that they somehow crammed 80 WGPs - or even twice the SIMDs per WGP - while staying inside 300W and on the same process and reaching 2.4GHz clocks isn't realistic at all.
The XT's rumoured base clock is 1500MHz or slower!


When you're saying that NV did it you're forgetting that Turing actually had these lanes already, they just weren't capable of FP32. So not the same situation.
Well the 52% increase in transistor count for 17% more SMs and ROPs (1 extra GPC) needs to be explained... Sure, transistor counts can't translate directly across foundries, let alone nodes, but 52% is a different ballgame.
 
There's no sign of such in XSX die though. PS5 die (based on approximate area) shows no sign of a gross die size increase versus Navi 10 or Navi 14.

There's 30%+ die area that can't be explained!
XSX GPU seem to be a mash up of RDNA1 and RDNA2 - or to be more precise RDNA2 h/w configured similarly to RDNA1. I don't know how much we can assume based on XSX die of PC RDNA2 parts.
And there is a clear explanation if Navi 21 will in fact come with a 256 bit G6 bus. If the cache rumors are true it could be a very interesting design which will scale differently to what we may assume from console and RDNA1 parts.

The XT's rumoured base clock is 1500MHz or slower!
Base clock is for "power viruses" like Furmark, never to be used in real world workloads. Game clock of 2-2.1 GHz will actually be higher than that of Navi 10.

Well the 52% increase in transistor count for 17% more SMs and ROPs (1 extra GPC) needs to be explained... Sure, transistor counts can't translate directly across foundries, let alone nodes, but 52% is a different ballgame.
There are many changes in Ampere besides the new SIMD configuration. FP32 ALUs should be more complex than INT32 and they are likely in addition to the latter there but these 52% are possibly distributed evenly between everything new in Ampere - new caches, new regfile, new RT cores, new TCs, new MCs, new ROPs, etc.
 
XSX GPU seem to be a mash up of RDNA1 and RDNA2 - or to be more precise RDNA2 h/w configured similarly to RDNA1. I don't know how much we can assume based on XSX die of PC RDNA2 parts.
I agree, generally XSX (and PS5) are problematic here because, like with their use of Zen 2 not Zen 3, they should be assumed to be "0.x% of RDNA 2". But for clocks, specifically, PS5 shows no signs of a substantial change in die area beyond what's seen in Navi 10/14, despite being much faster.

And there is a clear explanation if Navi 21 will in fact come with a 256 bit G6 bus. If the cache rumors are true it could be a very interesting design which will scale differently to what we may assume from console and RDNA1 parts.
Yes, I agree, if this is a monster cache, then performance scaling analysis is a new mystery. Even if it's not a monster cache, it will be a new mystery. Similar to the "tiled" rasterisation that Maxwell brought, which was a radical improvement.

Navi 23's die size discrepancy is about half of Navi 21's. If it's a single shader engine with 32 ROPs (even with doubled zixel-rate) on 128-bit bus targetted at 1080p gaming that leaves us contemplating a 64MB last level cache, if Navi 21 is 128MB.

Base clock is for "power viruses" like Furmark, never to be used in real world workloads. Game clock of 2-2.1 GHz will actually be higher than that of Navi 10.
So GA102 in sustained maximum FP32 compute runs at much higher than 1400MHz (ish)? I can't find any data on this...

There are many changes in Ampere besides the new SIMD configuration. FP32 ALUs should be more complex than INT32 and they are likely in addition to the latter there but these 52% are possibly distributed evenly between everything new in Ampere - new caches, new regfile, new RT cores, new TCs, new MCs, new ROPs, etc.
The items in your list, all added together, are a small percentage of the entire die.
 
80 WGPs with 4x SIMD-32s vs 84 SMs with 8x SIMD-16s. Hmm...

Are you saying that 4 TMU lanes in an SM versus 8 TMU lanes in a WGP is a major factor here? Is there something else? I can't read your mind.

Let’s look at it another way. TU102 -> GA102 required 50% more transistors for 17% more SMs, same bus width and a few tweaks to existing functionality. Maybe GDDR6X took a chunk of that. Or the new FP32 data path was actually not that cheap. But let’s assume it is.

We know RDNA 2 is introducing raytracing hardware and promises much greater power efficiency. I think it’s fair to assume RDNA2 CUs will grow significantly just for those 2 items alone.

Is it really realistic to quadruple the number of these beefier CUs on the same process while keeping power in check? If I had to guess, your area estimates are just off or AMD spent transistors somewhere obvious like a wider memory bus.
 
Last edited:
So GA102 in sustained maximum FP32 compute runs at much higher than 1400MHz (ish)? I can't find any data on this...

BFD64229-4-A63-440-B-9-EBC-2-CE5082-BCA98.webp
 
My current hypothesis is that the widening gap between base clocks and boost/game clocks is related to the fact that in most 'legacy' games the RT hardware is completely clock gated and powered down, so the entire TDP/TGP envelope can be spent on keeping the clocks up for the other functional units. Heavy workloads with RT enabled are where I would expect to see lower clocks. At any rate, we shall see soon enough. This is definitely one of the more interesting product cycles I've followed in many years.

One other thing to consider is that with tight supply @ TSMC for all of their advanced nodes, this may well cost more per die than GA102. TSMC is definitely charging more per unit of die area and likely even more per transistor than Samsung is on their 8nm process. Since we don't have detailed yield numbers on the Samsung side it's going to be tough to really know exactly the cost difference, but I wouldn't be surprised if GA102 is slightly cheaper to produce. **with the caveat that lower yields on Samsung's process means a higher ratio of 'harvested' dies to fully enabled and functional ones.
 
Last edited:
If all you want to do is compute fractals in FP32 like AIDA64 does, the 3080 definitely is the best.
If you have a real interest in fractals, you would want to have at least FP64 precision, and there a 7820X CPU / AVX512 or Zen 16 core / AVX2 easily beats a 3080.
 
Last edited:
2080Ti is 23% faster than rated while 3080 is 5% faster (both compared with boost clock). So a 14% cut-down GA102 is throttling dramatically more than a 6% cut-down TU102.

Is that sustained or over quickly? If it's not sustained, then left for longer, GA102 would get slower and slower.

In this video:


2080Ti completes the single precision test in "under 10s" (video is at 10x speed). So that's not sustained.

Clock is locked at 1980 (far above base clock) here on 3080 stress test:


Clock is variable on 2080Ti stress test, but falls only as far as boost clock:


So GA102 is not running at "power virus" clock (base) in Furmark as claimed by @DegustatoR

So the evidence right now appears to indicate that compute is throttling GA102 and that Furmark doesn't.

Furmark stress test on 5700XT never falls below "game clock" of 1905MHz:

 
Couple of other things to add on what might be driving the increased transistor cost/HW in addition to the rumoured cache (and of course rumuored die sizes):-

1. VRS in addition to RT, probably a few % for both. Possibly higher IPC per CU
2. Redesigned ROPs. As per MS, there is some additional Memory compression HW as well
3. ML - int8/int4 support
4. HW Decode - Likely support for 8K AV1 decode and possibly more display outputs/VR related HW
5. IO decompression - Possibly solutions similar to XSX/PS5
6. Transistors spent on optimizations for area and power/dark silicon as suggested. More fine-grained clock-gating is also likely. This can be non-trivial

All of these can add up. FWIW, i'm not so sure on N23 being only 32 CUs if the N22 being 40 CUs rumour is also true. One of them has to be wrong.
 
We know RDNA 2 is introducing raytracing hardware and promises much greater power efficiency. I think it’s fair to assume RDNA2 CUs will grow significantly just for those 2 items alone.
XSX CUs show approximately 0 growth in area compared with Navi 10 and 14. Supposedly the AMD ray tracing patent document indicates a minor increase in die size. (Maybe one day I'll bother reading it.)

Navi 21 is rumoured to support less hardware threads per SIMD (16 instead of 20). I can't see that this would make a major difference in CU size though. In theory more complex shaders are likely to result in less hardware threads being able to fit into a SIMD, therefore supporting 20 is pointless. Less hardware threads in flight per SIMD also helps with L0 cache coherency (reduced thrashing there). But 16 versus 20 isn't some amazing difference... 8 would make me think for longer...

If only XSX CUs were dramatically bigger...

Is it really realistic to quadruple the number of these beefier CUs on the same process while keeping power in check?
Rumours are for a 60% performance per watt improvement. That's more than normally occurs with a full node transition! It seems to indicate that Navi 10 was a terrible failure.

Can we expect Navi 23 (6500XT?) to give the same performance as 5700XT? 32 CUs, 32 ROPs, 128-bit bus, 9.8TFLOPS at 2400MHz (guessed clock). That seems pretty reasonable to me. It supports the monster-cache rumour.

How much power would that consume? 130W like 5500XT? That's a 50% bigger die than Navi 14 (235 versus 157), which is rated for 130W.

If I had to guess, your area estimates are just off
My modelled
  • Navi 14 derived from Navi 10 is off by about 10mm² (167 versus 157 actual)
  • Navi 23 (with 32 WGPs) derived from Navi 10 is 13mm² too large (248 versus 235 rumoured).
Those are both 6% out. I think Navi 23 will actually be smaller than 235mm², because packaging makes the size too large. My Navi 10-based models are a few weeks old now and could be improved by my Navi 14 analysis, but 6% is crushing my enthusiasm.

(There is an error in what I published a few weeks ago, but it's a self-cancelling error in the "uncore" versus "miscellaneous IO + edges" areas - they're mutually derived, so the error cancels in other modelled GPUs. Uncore is twice as large, so miscellaneous IO + edges is half-sized. I haven't come up with a way to improve the model for these two areas, but it's notable that Navi 14 has a smaller miscellaneous IO + edges area - less HDMI/DP? Seems strange, since there's a "pro" variant... The error in these, jointly, could be 10-15mm², out of about 60-70mm² when modelled for other GPUs, I suppose, but the Navi 10 die shot is too poor in quality to understand these areas.)

My modelled PS5 die size is about 8mm² too large (316 versus 308) which is only 3%, but 308mm² is from the PS5 teardown video, so that seems murky and packaging makes it "measure" too large.

Navi 21 die size, 536mm², is too vague in my opinion to make much of a claim (modelled 524mm² with 80 WGPs). Packaging-derived measurement error would be about 20mm², e.g. 516mm².

AMD spent transistors somewhere obvious like a wider memory bus.
On-chip or >256-bit GDDR6?

An improved on-die memory system, "Infinity Fabric" combined with "Infinity Cache" should use more transistors, definitely, even without a monster cache. XSX die shows a pair of fairly large blocks that are labelled as "SOC fabric, coherency, G6 MCs" (about 25mm² for the two). I dunno how to translate that into Navi 2x... Some of that should be directly associated with the CPU, but some of it is "Infinity ...".
 
XSX CUs show approximately 0 growth in area compared with Navi 10 and 14. Supposedly the AMD ray tracing patent document indicates a minor increase in die size. (Maybe one day I'll bother reading it.)

Navi 21 is rumoured to support less hardware threads per SIMD (16 instead of 20). I can't see that this would make a major difference in CU size though. In theory more complex shaders are likely to result in less hardware threads being able to fit into a SIMD, therefore supporting 20 is pointless. Less hardware threads in flight per SIMD also helps with L0 cache coherency (reduced thrashing there). But 16 versus 20 isn't some amazing difference... 8 would make me think for longer...

If only XSX CUs were dramatically bigger...


Rumours are for a 60% performance per watt improvement. That's more than normally occurs with a full node transition! It seems to indicate that Navi 10 was a terrible failure.

Can we expect Navi 23 (6500XT?) to give the same performance as 5700XT? 32 CUs, 32 ROPs, 128-bit bus, 9.8TFLOPS at 2400MHz (guessed clock). That seems pretty reasonable to me. It supports the monster-cache rumour.

How much power would that consume? 130W like 5500XT? That's a 50% bigger die than Navi 14 (235 versus 157), which is rated for 130W.


My modelled
  • Navi 14 derived from Navi 10 is off by about 10mm² (167 versus 157 actual)
  • Navi 23 (with 32 WGPs) derived from Navi 10 is 13mm² too large (248 versus 235 rumoured).
Those are both 6% out. I think Navi 23 will actually be smaller than 235mm², because packaging makes the size too large. My Navi 10-based models are a few weeks old now and could be improved by my Navi 14 analysis, but 6% is crushing my enthusiasm.

(There is an error in what I published a few weeks ago, but it's a self-cancelling error in the "uncore" versus "miscellaneous IO + edges" areas - they're mutually derived, so the error cancels in other modelled GPUs. Uncore is twice as large, so miscellaneous IO + edges is half-sized. I haven't come up with a way to improve the model for these two areas, but it's notable that Navi 14 has a smaller miscellaneous IO + edges area - less HDMI/DP? Seems strange, since there's a "pro" variant... The error in these, jointly, could be 10-15mm², out of about 60-70mm² when modelled for other GPUs, I suppose, but the Navi 10 die shot is too poor in quality to understand these areas.)

My modelled PS5 die size is about 8mm² too large (316 versus 308) which is only 3%, but 308mm² is from the PS5 teardown video, so that seems murky and packaging makes it "measure" too large.

Navi 21 die size, 536mm², is too vague in my opinion to make much of a claim (modelled 524mm² with 80 WGPs). Packaging-derived measurement error would be about 20mm², e.g. 516mm².


On-chip or >256-bit GDDR6?

An improved on-die memory system, "Infinity Fabric" combined with "Infinity Cache" should use more transistors, definitely, even without a monster cache. XSX die shows a pair of fairly large blocks that are labelled as "SOC fabric, coherency, G6 MCs" (about 25mm² for the two). I dunno how to translate that into Navi 2x... Some of that should be directly associated with the CPU, but some of it is "Infinity ...".
Do we know with certainty what TSMC node XBSX soc is made on? Perhaps enhance 7nm is TSMCs 7+nm which possibly accounts for the 0% increase in CU size
 
Do we know with certainty what TSMC node XBSX soc is made on? Perhaps enhance 7nm is TSMCs 7+nm which possibly accounts for the 0% increase in CU size
No, we don't know for certain, but "AMD enhanced 7nm" sounds awfully lot like the tweaked N7 used in Zen2 Refresh and Zen 3. Navi10 is on N7P, but it's transistor density is same as N7
 
No, we don't know for certain, but "AMD enhanced 7nm" sounds awfully lot like the tweaked N7 used in Zen2 Refresh and Zen 3. Navi10 is on N7P, but it's transistor density is same as N7
I agree, but perhaps they were under an agreement not to disclose the actual process node until AMD does
 
Papermaster: It is in fact the core is in the same 7nm node, meaning that the process design kit [the PDK] is the same. So if you look at the transistors, they have the same design guidelines from the fab. What happens of course in any semiconductor fabrication node is that they are able to make adjustments in the manufacturing process so that of course is what they’ve done, for yield improvements and such. For every quarter, the process variation is reduced over time. When you hear ‘minor variations’ of 7nm, that is what is being referred to.
 
No, we don't know for certain, but "AMD enhanced 7nm" sounds awfully lot like the tweaked N7 used in Zen2 Refresh and Zen 3. Navi10 is on N7P, but it's transistor density is same as N7

Yes, N7P only offers power and/or performance benefits. If what Papermaster is alluding to is true, Zen3 may not even be on N7P though I'd be surprised if it wasn't on the more advanced process by late 2020. And we have to remember that even within the same process node, density can vary widely, as we see with 7nm mobile SoCs and even to a certain extent with A100. I have not been able to ascertain with certainty which node it is using though. I would expect them to be on N7+ (EUV) for density and power reasons but I can't find any information to confirm this. For Navi 2x as well, the density advantage of N7+ would be advantageous but they could have stayed with N7P as well.
 
Yes, N7P only offers power and/or performance benefits. If what Papermaster is alluding to is true, Zen3 may not even be on N7P though I'd be surprised if it wasn't on the more advanced process by late 2020. And we have to remember that even within the same process node, density can vary widely, as we see with 7nm mobile SoCs and even to a certain extent with A100. I have not been able to ascertain with certainty which node it is using though. I would expect them to be on N7+ (EUV) for density and power reasons but I can't find any information to confirm this. For Navi 2x as well, the density advantage of N7+ would be advantageous but they could have stayed with N7P as well.
7nm mobile SoCs use different libraries compared to high performance chips, though, which explains the major difference in transistor densities.
 
Not sure about Navi, but Zen3 may just be as well on the plain 7 nm, it improved a lot since the launch of zen2. Older R6 3600s could barely reach 4-4.1 ghz all-core clock with high-ish voltages (circa 1.35V) while newer CPUs (20xx PGT/PGS) could easily do 4.4-4.5 ghz at low voltages (1.25V or even less) with full stability in both LinX or Prime95. So, clocks might have improved a lot even without going N7P, N7+, N7 EUV or whatever it's called now.
 
I've been wondering why Microsoft and Sony have recently been talking about backwards-compatible games "running in GCN mode not RDNA".

USE OF WORKGROUPS IN PIXEL SHADER

A technique for executing pixel shader programs is provided. The pixel shader programs are executed in workgroups, which allows access by work-items to a local data store and also allows program synchronization at barrier points. Utilizing workgroups allows for more flexible and efficient execution than previous implementations in the pixel shader stage. Several techniques for assigning fragments to wavefronts and workgroups are also provided. The techniques differ in the degree of geometric locality of fragments within wavefronts and/or workgroups. In some techniques, a greater degree of locality is enforced, which reduces processing unit occupancy but also reduces program complexity. In other techniques, a lower degree of locality is enforced, which increases processing unit occupancy.
It seems to suggest that efficiency for pixel shading can be improved by grouping fragments into more than a single-wavefront-workgroup, e.g. a workgroup containing four wavefronts.

Quite a lot of state is shared by all fragments in a wave, so making a wave larger can be beneficial. Though there's still going to be a problem with register allocation, which is the typical reason why large workgroups have limited use in compute (along with their low latency hiding). If a workgroup is spread across some or all SIMDs in a WGP, then the register allocation problem is ameliorated.

Along the way the document implies that current hardware stores some pixel shader state (e.g. vertex attributes) in LDS. I don't know if that's actually the case, but it's interesting.

The document then proceeds to describe how the pixel shader can be broken up into stages. Any stage that performs computations that are shared by all fragments in the workgroup can be computed by a single wave, not all waves. The results are put into LDS and then all waves in the workgroup access those results.

Workgroup-orientated execution of pixel shaders would also help with cache coherency, which has been an on-going theme of RDNA.
 
Along the way the document implies that current hardware stores some pixel shader state (e.g. vertex attributes) in LDS. I don't know if that's actually the case, but it's interesting.

The current hardware stores the vertex attributes that are used for interpolation in LDS during a pixel shader invocation. (For up to 16 triangles per PS wave and you interpolate them with V_INTERP_P1_F32 and V_INTERP_P2_F32 instructions in the shader which read directly from LDS)
 
Status
Not open for further replies.
Back
Top