AMD: Navi Speculation, Rumours and Discussion [2019-2020]

DavidGraham · May 29, 2019

no-X said:
How would you explain, that expected performance of Navi top model is about 10 % under Vega 20, but power consumption is ~33 % lower?

It's not going to be just 10% under Vega 20, AMD showed their absolute best result with their weird Strange Brigade choice (which is not even a popular AA game), expect much worse real world results than this.

no-X · May 29, 2019

AMD has some marketing deal with Rebellion. That's a known thing. Your opinion, that Strange Brigade showed the absolute best result is just a speculation.

Gubbi · May 29, 2019

3dilettante said:
There are signs of changes, though the devil would be in the details.
Having register bank conflicts and a register destination cache point to a potentially more complicated operand network. There's the storage and wiring for the register cache, which might be several KB per SIMD and more ports than the main file. The main file may have more arbitration and logic attached to it. Depending on how instruction issue has changed for the SIMDs, there may be more queues and incrementally more hardware for elements like the removed data hazards.

The L0 operand/result cache would lower utilization of the SRAM register file allowing more ALUs to be served by the same register file, or at the very least save a lot of power. It would be interstring to know if conflicts are handled in hardware or in software. If it is software, scheduling would be isolated to a single shader program, if it is in hardware you could have independent shader programs running on the same CU. Maybe the super SIMD extension is a means to utilize spare register file bandwidth (because request/stores are served by the L0 operand cache).

Cheers

snc · May 29, 2019

anexanhume said:
AMD claimed there was a 1.25X performance advantage at the same power 14nm-> 7nm. So that means there’s actually zero perf/Watt advantage going from Vega to Navi. It’s all process savings.

What ? "1.25x improvement per clock, and 1.5x inprovement per watt"

DavidGraham · May 29, 2019

no-X said:
Your opinion, that Strange Brigade showed the absolute best result is just a speculation.

It really is not, AMD GPUs are known to punch above their weight in this title, the fact they showed this only title in comparison to 2070 speaks volumes about their real world performance. They did the same with Radeon VII by the way, showing it beating the 2080 in this title by a big margin, and we all know how that turned out in the end.

Bondrewd · May 29, 2019

DavidGraham said:
They did the same with Radeon VII by the way, showing it beating the 2080 in this title by a big margin, and we all know how that turned out in the end.

2080 currently beats client Vega20 in it by ~5% median so no, it wasn't the best case, not even close.

Please stop with this bullshit.

Jawed · May 29, 2019

Gubbi said:
The L0 operand/result cache would lower utilization of the SRAM register file allowing more ALUs to be served by the same register file, or at the very least save a lot of power. It would be interstring to know if conflicts are handled in hardware or in software. If it is software, scheduling would be isolated to a single shader program, if it is in hardware you could have independent shader programs running on the same CU. Maybe the super SIMD extension is a means to utilize spare register file bandwidth (because request/stores are served by the L0 operand cache).

Another wild arsed guess: there will not be a three-operand fetch from the register file for any instruction. A maximum of two instructions will come from the register file and the other has to come from operand cache (e.g. for FMA).

It's basically pointless in this modern era to make the register file support three-operand fetches when there's so few instructions that can use three operands.

Additionally, operands that go to cache but never need to go to the register file (having a short lifetime, e.g. one or two cycles) save write power/bandwidth to VGPRs.

Since VGPRs consume quite a lot of space per CU, power will be relatively high to perform reads/writes (power is a function of distance), so less data making these round trips will save power. Reducing addressing bandwidth and having a loosened banking configuration will also make the VGPRs consume less power.

PSman1700 · May 29, 2019

Bondrewd said:
2080 currently beats client Vega20 in it by ~5% median so no, it wasn't the best case, not even close.

Please stop with this bullshit.

It is not BS, its logical for a company to use the best case scenario when demonstrating a product for the first time. I think we have to wait for real world game tests to see if it can match a RTX2070, let alone a 2080 or higher.

https://www.pcgamer.com/amds-latest-gpu-driver-aims-to-boost-performance-in-strange-brigade/

anexanhume · May 29, 2019

snc said:
What ? "1.25x improvement per clock, and 1.5x inprovement per watt"

When they launched Radeon 7, they claimed 1.25X performance iso power.

Bondrewd · May 29, 2019

PSman1700 said:
It is not BS, its logical for a company to use the best case scenario when demonstrating a product for the first time

It is BS because it's no longer the best case scenario ever since that nV driver update.

Turing is super-competetive there now, and they did compare the unnamed Navi against 2070.

PSman1700 · May 29, 2019

AMD not using a benchmark that favours them is hard to believe. Anyway, better to wait for real world game performance tests and reviews. Usually AMD and Nvidias own benchmark are what they are, BS.

DavidGraham · May 30, 2019

Bondrewd said:
2080 currently beats client Vega20 in it by ~5% median so no, it wasn't the best case, not even close.

Only in DX12, AMD used the Vulkan API to do their Vega 7 vs 2080 comparison, in which they still have a very large lead. So they DO insist on using their absolute best case scenario.

Bondrewd said:
Please stop with this bullshit.

Please use a more civlized manner of conversation or I will be forced to retaliate harshly.

3dcgi · May 30, 2019

CarstenS said:
I was told by an AMD rep, that there is a reason other than marketing, that salvage parts have CUs disabled according to the number of SEs. Nothing more specific though and this info is a couple of years old, so it could be obsolete with Vega already.

There are sync points like maintaining raster order so if one SE runs ahead it won't get to far.

Bondrewd said:
Applying "IPC" to GPUs is whack, but whatever.

IPC means better performance per clock given the same configuration.

itsmydamnation · May 30, 2019

DavidGraham said:
Only in DX12, AMD used the Vulkan API to do their Vega 7 vs 2080 comparison, in which they still have a very large lead. So they DO insist on using their absolute best case scenario.

You know this how?

w0lfram · May 30, 2019

Jawed said:
Another wild arsed guess: there will not be a three-operand fetch from the register file for any instruction. A maximum of two instructions will come from the register file and the other has to come from operand cache (e.g. for FMA).

It's basically pointless in this modern era to make the register file support three-operand fetches when there's so few instructions that can use three operands.

Additionally, operands that go to cache but never need to go to the register file (having a short lifetime, e.g. one or two cycles) save write power/bandwidth to VGPRs.

Since VGPRs consume quite a lot of space per CU, power will be relatively high to perform reads/writes (power is a function of distance), so less data making these round trips will save power. Reducing addressing bandwidth and having a loosened banking configuration will also make the VGPRs consume less power.

What if... the L0 operand/result cache is connected directly to the HBCC …?

3dilettante · May 30, 2019

anexanhume said:
Thanks, I know that Rambus claimed GDDR6 controllers were quite a bit larger than HBM, so even a 4096 bit HBM2 interface may be smaller than a 256-bit GDDR6 interface.

Edit: actually comparing to Vega VII, it should be smaller on Navi. 128 bit GDDR6 is 1.5-1.75X larger than single stack HBM, but Vega VII is quad stack.

Smaller GPUs usually have a larger proportion of their area taken up by controllers and hardware outside of the primary graphics area, so even if smaller on absolute terms, a 256-bit GDDR6 setup can have more of an impact than a quad-HBM2 Vega 7, depending on other factors.
However, one of the other factors is that Vega 7 has an unusually large amount of area around the graphics core, which seems to contribute to the area bloat versus an ideal shrink from 14nm.
One of the possible residents on the die is additional infinity fabric blocks to connect the other HBM controllers, and possibly more mesh connecting the two sides--taking up non-zero area. GDDR6 likely takes up 3 or possibly 3.x sides, and may require a more sprawling interconnect.

Gubbi said:
The L0 operand/result cache would lower utilization of the SRAM register file allowing more ALUs to be served by the same register file, or at the very least save a lot of power. It would be interstring to know if conflicts are handled in hardware or in software. If it is software, scheduling would be isolated to a single shader program, if it is in hardware you could have independent shader programs running on the same CU. Maybe the super SIMD extension is a means to utilize spare register file bandwidth (because request/stores are served by the L0 operand cache).

Cheers

There's comments about bank conflicts that seems to indicate that there's a best-effort attempt at gather operands by the CU, and if the conflicts are significant enough there will be stalls. There's no indication of any encoding changes for the instructions to indicate that software has any other means of handling conflicts besides paying attention to the register IDs that belong to the same banks.
https://github.com/llvm-mirror/llvm...b3561b2#diff-1fe939c9865241da3fd17c066e6e0d94

(from GCNRegBankReassign.cpp -- note that it's not called RDNARegBankReassign)

/// On GFX10 registers are organized in banks. VGPRs have 4 banks assigned in
/// a round-robin fashion: v0, v4, v8... belong to bank 0. v1, v5, v9... to
/// bank 1, etc. SGPRs have 8 banks and allocated in pairs, so that s0:s1,
/// s16:s17, s32:s33 are at bank 0. s2:s3, s18:s19, s34:s35 are at bank 1 etc.
///
/// The shader can read one dword from each of these banks once per cycle.
/// If an instruction has to read more register operands from the same bank
/// an additional cycle is needed. HW attempts to pre-load registers through
/// input operand gathering, but a stall cycle may occur if that fails. For
/// example V_FMA_F32 V111 = V0 + V4 * V8 will need 3 cycles to read operands,
/// potentially incuring 2 stall cycles.

As far as L0, there's more than one context that term has been used. If discussing a destination cache in register file patents, AMD described the output flops of the register file as serving as an L0 for repeated accesses to the same ID, rather than describing the register output cache.
The L0 in the LLVM changes appears to be a CU-local memory pool that plays a role in memory access ordering and can impact data visibility to wavefronts in other CUs, which seems distinct from the question of augmenting the register file and result forwarding within a SIMD.

Jawed said:
Another wild arsed guess: there will not be a three-operand fetch from the register file for any instruction. A maximum of two instructions will come from the register file and the other has to come from operand cache (e.g. for FMA).

It's basically pointless in this modern era to make the register file support three-operand fetches when there's so few instructions that can use three operands.

Additionally, operands that go to cache but never need to go to the register file (having a short lifetime, e.g. one or two cycles) save write power/bandwidth to VGPRs.

There seems to be an implication from the above code comments that there are 4 banks of vector register file, and unlike prior GCN architectures it's not a foregone conclusion that an instruction has guaranteed access to them. Going by the description of the stall behavior, it's possible that an FMA could source 3 in the same cycle with the appropriate register allocation pattern.
A significant motivation for the super-SIMD patent is to use the lost operand access cycles and this could entail dual-issue or faster issue latency. The odd way the GFX10 changes document latencies may be consistent with something along those lines.

w0lfram said:
What if... the L0 operand/result cache is connected directly to the HBCC …?

I'm not clear on the full purpose of it, but it's called an L0 and there's mention of an L1 as well. Possibly, there's an L2 or something else beyond. The HBCC in Vega is past all the cache layers, and since its job is paging resources into the local VRAM pool it's not specced to handle something like all the local cache output of the CUs.

DavidGraham · May 30, 2019

itsmydamnation said:
You know this how?

snc · May 30, 2019

anexanhume said:
When they launched Radeon 7, they claimed 1.25X performance iso power.

And now they claim 1.5x

del42sa · May 30, 2019

snc said:
And now they claim 1.5x

no , they claimed exactly the same with VEGA , 1,25x performance and 1,5x perf./watt

https://images.anandtech.com/doci/13923/next_horizon_david_wang_presentation-06.png

no-X · May 30, 2019

DavidGraham said:
It really is not, AMD GPUs are known to punch above their weight in this title, the fact they showed this only title in comparison to 2070 speaks volumes about their real world performance. They did the same with Radeon VII by the way, showing it beating the 2080 in this title by a big margin, and we all know how that turned out in the end.

So it's not only a speculation, but it's a speculation based on some doubtful "facts"
1. AMD presented RVII performance in three games representing three APIs (DX11, DX12, Vulkan). They haven't cherry picked only "their absolute best result". Not even the Strange Brigade performance was "their absolute best result" at that time. E.g. in Call of Duty: Black Ops 4 RVII was 20 % faster than RTX 2080.
2. Nvidia released drivers, which boosted performance in Strange Brigade, so situation changed.
3. Why should we expect, that a game, which was fine with Vega architecture, will be the best choice for Navi architecture? There are several games, which like Polaris, but don't like Fiji.
4. Strange Brigade was never "their absolute best result" and is not even today. E. g. Radeon VII in World War Z @2560×1440 performs 28 % faster than RTX 2080. That would be probably "their absolute best result" to present Navi if(!) the architecture prefers the same workloads.
5. It's not likely, that Navi prefers the same type of workloads as Vega, because the compute/fillrate and compute/geometry ratios were completely changed. It will have significantly lower compute performance, but significantly higher geometry performance than Vega.

So, your findings are really just a speculation and it's not even a speculation based on facts. It's a speculation based on other speculation.