AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Status
Not open for further replies.
higher IPC have to do with efficiency one thing: If the chip can do more instruction per clock and thus more performance, it can be clocked lower and consume less power
Applying "IPC" to GPUs is whack, but whatever.

Also no uArch improvement is ever free, they all cost power, and of course you have to clock your designs as high as you can since N7 isn't cheap.
 
just something to add: when you look at slide, where AMD presented VEGA 20 , you can see that they achieved 50% lower power at the same frequency just because of transition to 7nm and it´s "good old" GCN chip, right ?

https://images.anandtech.com/doci/13923/next_horizon_david_wang_presentation-06.png

at computex 2018 they presented slide, where 7nm shows 2x power efficiency

https://fudzilla.com/images/stories/2018/June/davidWang.jpg

Now back to NAVI. It´s different. It´s RDNA right ? AMD said 1,5 perf./watt it´s combination of three thing: Process, design capability and architecture. Puzzling them together should IMO bring more to table than just simple shrink GCN from 14nm to 7nm.
 
at computex 2018 they presented slide, where 7nm shows 2x power efficiency

https://fudzilla.com/images/stories/2018/June/davidWang.jpg
It's related to the process, not to a product. That means 2× power efficiency for the same chip desing running at the same clocks, usually quite low-clocks related to small mobile chips. That numbers came from TSMC. AMD (later in 2018) stated, that real numbers are lower than expected (about 40 % power reduction for the same desing at the same clocks). Navi is a new desing and it will definitely not run at the same clock as 14nm Vega.
 
Either way, what's with the total lack of any relevant Navi leaks?
Was the RDMA acronym ever actually leaked?
There's been a modest stream of LLVM changes coming out, and a few curious benchmark database entries.
I think the acronym is RDNA. At least from the code changes, I haven't run into any mention of RDNA despite many GCN references and shared flags with GCN GPUs, including many that have GFX10 lining up with older GCN architectures.

Perhaps the lack of mention is due to secrecy purposes, or the RDNA label is not used by a number of staff responsible for supporting the architectures for other reasons like not being communicated to them or not used by them.

Haven’t had a chance to fully dig into the links being made here, but they suggest RDNA is indeed Super-SIMD as conceived of in the patent.
I'm not 100% certain I'm reading the autotranslated text totally right, and I'm not ruling out a possible change like dual-issue or a difference in issue latency.
However, based on my (non-authoritative) interpretation and what I think is being said, I think the changes are being misconstrued.

There is a potential difference in how workgroups allocate their wavefronts, with it being possible in one mode for a workgroup to have wavefronts on more than one CU. There are implications as far as what that means for barriers that only had to be supported within a CU when workgroups were limited to one CU each. The memory comments seem concerned about visibility/ordering of workgroup memory accesses in the event that wavefronts are no longer reading/writing to one CU's local cache. This seems pf higher importance in the context of the code those changes were made to dealing with synchronization and writes to possible shared global memory.
There may be something new about this L0, as there is a new bit for coherence, more active discussion of invalidations versus the write-through GCN L1, and a new memory counter. What specifically the L0 is versus the L1 in prior generations isn't clear, as the vector memory path in current GCN does order accesses within a wavefront at least.
I don't think it's the same as the patents' register cache, which is local to a SIMD/cluster and on the wrong end of the memory pipeline to be of any concern for other CUs or wavefronts.
There is a single reference about a register destination cache in https://github.com/llvm-mirror/llvm...3380939#diff-ad4812397731e1d4ff6992207b4d38fa, which is a different file with a different purpose.
There's some discussion of code comments for the buggy prefetch instruction, and I think some discussion of the size of either the vector cache or L0 that I think may be red herrings. For one thing, the prefetch and I$ comments are dealing with instruction fetch, which is not subject to synchronization operations or atomic writes. Ordering concerns between CUs for static code seems unnecessary. Claims as to the size of the destination cache somehow matching a workgroup don't seem supported in what I have read, and I think they wouldn't be consistent with the patents--I may be misreading the translation, though.
 
There's been a modest stream of LLVM changes coming out, and a few curious benchmark database entries.
I think the acronym is RDNA. At least from the code changes, I haven't run into any mention of RDNA despite many GCN references and shared flags with GCN GPUs, including many that have GFX10 lining up with older GCN architectures.

Perhaps the lack of mention is due to secrecy purposes, or the RDNA label is not used by a number of staff responsible for supporting the architectures for other reasons like not being communicated to them or not used by them.


I'm not 100% certain I'm reading the autotranslated text totally right, and I'm not ruling out a possible change like dual-issue or a difference in issue latency.
However, based on my (non-authoritative) interpretation and what I think is being said, I think the changes are being misconstrued.

There is a potential difference in how workgroups allocate their wavefronts, with it being possible in one mode for a workgroup to have wavefronts on more than one CU. There are implications as far as what that means for barriers that only had to be supported within a CU when workgroups were limited to one CU each. The memory comments seem concerned about visibility/ordering of workgroup memory accesses in the event that wavefronts are no longer reading/writing to one CU's local cache. This seems pf higher importance in the context of the code those changes were made to dealing with synchronization and writes to possible shared global memory.
There may be something new about this L0, as there is a new bit for coherence, more active discussion of invalidations versus the write-through GCN L1, and a new memory counter. What specifically the L0 is versus the L1 in prior generations isn't clear, as the vector memory path in current GCN does order accesses within a wavefront at least.
I don't think it's the same as the patents' register cache, which is local to a SIMD/cluster and on the wrong end of the memory pipeline to be of any concern for other CUs or wavefronts.
There is a single reference about a register destination cache in https://github.com/llvm-mirror/llvm...3380939#diff-ad4812397731e1d4ff6992207b4d38fa, which is a different file with a different purpose.
There's some discussion of code comments for the buggy prefetch instruction, and I think some discussion of the size of either the vector cache or L0 that I think may be red herrings. For one thing, the prefetch and I$ comments are dealing with instruction fetch, which is not subject to synchronization operations or atomic writes. Ordering concerns between CUs for static code seems unnecessary. Claims as to the size of the destination cache somehow matching a workgroup don't seem supported in what I have read, and I think they wouldn't be consistent with the patents--I may be misreading the translation, though.
As always, thanks for the in-depth insight. Is there anything apparent to you that would cause CU growth to you? It sounds as if we’re getting 40-48 CUs in 250mm^2 on 7nm, which certainly sounds bigger than Polaris CUs adjusted for scaling, as well as Vega CUs based on VII’s 64 CUs in 330mm^2.
 
Polaris has 32 ROPs , Navi (probably?) 64. That would mean a significant difference in terms of die size. Btw. 256bit GDDR6 controller is definetely bigger than 256bit GDDR5 controller. 256bit GDDR6 PHY should be bigger than HBM2 PHY on Vega.
 
Polaris has 32 ROPs , Navi (probably?) 64. That would mean a significant difference in terms of die size. Btw. 256bit GDDR6 controller is definetely bigger than 256bit GDDR5 controller. 256bit GDDR6 PHY should be bigger than HBM2 PHY on Vega.
Plus there are several rumors pointing to Navi doubling on the front-end by getting 8 shader engines, each serving 5 CUs. In Polaris each SE serves up to 9 CUs, and in Vega 10 each SE serves up to 16 CUs.
This also means twice the number of geometry processors compared to Vega 20, despite the lower CU count.
 
As always, thanks for the in-depth insight. Is there anything apparent to you that would cause CU growth to you? It sounds as if we’re getting 40-48 CUs in 250mm^2 on 7nm, which certainly sounds bigger than Polaris CUs adjusted for scaling, as well as Vega CUs based on VII’s 64 CUs in 330mm^2.

There are signs of changes, though the devil would be in the details.
Having register bank conflicts and a register destination cache point to a potentially more complicated operand network. There's the storage and wiring for the register cache, which might be several KB per SIMD and more ports than the main file. The main file may have more arbitration and logic attached to it. Depending on how instruction issue has changed for the SIMDs, there may be more queues and incrementally more hardware for elements like the removed data hazards.
The SIMD blocks themselves are not the only space consumers, as code changes and AMD's announcements include some sort of cache/memory update. The classic GCN L1/L2 hierarchy is somewhat lightweight in terms of what it does, so any notable enhancement is likely to cost area. The load/store and texturing blocks are significant area consumers that could expand with some memory changes, and there's a new addressing type for image instructions that could add some complexity. If there's more interconnect or cache levels there's probably extra area beyond the L1 blocks and L2.
There's some additional scalar memory operations that might indicate a somewhat improved scalar unit.

There are rumors of shader engine changes, including a new scheduling mode that might point to more complex logic coordinating the CUs. Perhaps other blocks like decode/encode blocks or controllers like HBCC are present.

For a 7nm GDDR6 GPU, the memory type would be significantly less compact than HBM, and the fabric linking clients likely has more clients and distance to traverse. That fabric in Vega was generally bulkier for what it did.
Various pipeline changes or extra buffering for higher clock speeds also add logic and area, as Vega apparently did for much of its additional transistor budget.
 
There are signs of changes, though the devil would be in the details.
Having register bank conflicts and a register destination cache point to a potentially more complicated operand network. There's the storage and wiring for the register cache, which might be several KB per SIMD and more ports than the main file. The main file may have more arbitration and logic attached to it. Depending on how instruction issue has changed for the SIMDs, there may be more queues and incrementally more hardware for elements like the removed data hazards.
The SIMD blocks themselves are not the only space consumers, as code changes and AMD's announcements include some sort of cache/memory update. The classic GCN L1/L2 hierarchy is somewhat lightweight in terms of what it does, so any notable enhancement is likely to cost area. The load/store and texturing blocks are significant area consumers that could expand with some memory changes, and there's a new addressing type for image instructions that could add some complexity. If there's more interconnect or cache levels there's probably extra area beyond the L1 blocks and L2.
There's some additional scalar memory operations that might indicate a somewhat improved scalar unit.

There are rumors of shader engine changes, including a new scheduling mode that might point to more complex logic coordinating the CUs. Perhaps other blocks like decode/encode blocks or controllers like HBCC are present.

For a 7nm GDDR6 GPU, the memory type would be significantly less compact than HBM, and the fabric linking clients likely has more clients and distance to traverse. That fabric in Vega was generally bulkier for what it did.
Various pipeline changes or extra buffering for higher clock speeds also add logic and area, as Vega apparently did for much of its additional transistor budget.
Thanks, I know that Rambus claimed GDDR6 controllers were quite a bit larger than HBM, so even a 4096 bit HBM2 interface may be smaller than a 256-bit GDDR6 interface.

Edit: actually comparing to Vega VII, it should be smaller on Navi. 128 bit GDDR6 is 1.5-1.75X larger than single stack HBM, but Vega VII is quad stack.

 
Last edited:
I take AdoredTV's multiple sources all saying that Navi is consuming too much power seriously. "50% gain in power efficiency" over any 14nm GPU is frankly rubbish when there's a new "more efficient" architecture involved.
AMD claimed there was a 1.25X performance advantage at the same power 14nm-> 7nm. So that means there’s actually zero perf/Watt advantage going from Vega to Navi. It’s all process savings.
 
AMD claimed there was a 1.25X performance advantage at the same power 14nm-> 7nm. So that means there’s actually zero perf/Watt advantage going from Vega to Navi. It’s all process savings.

yes and it´s probably a reason why they took Radeon VII out from equation and compare Navi with unspecified 14nm/12nm product....
 
AMD claimed there was a 1.25X performance advantage at the same power 14nm-> 7nm.
Such a comparision is only valid for the same design at the same clocks. Do you expect, that Navi is clocked the same as 14nm Vega?

So that means there’s actually zero perf/Watt advantage going from Vega to Navi.
How would you explain, that expected performance of Navi top model is about 10 % under Vega 20, but power consumption is ~33 % lower?
 
Status
Not open for further replies.
Back
Top