Thanks, I know that Rambus claimed GDDR6 controllers were quite a bit larger than HBM, so even a 4096 bit HBM2 interface may be smaller than a 256-bit GDDR6 interface.
Edit: actually comparing to Vega VII, it should be smaller on Navi. 128 bit GDDR6 is 1.5-1.75X larger than single stack HBM, but Vega VII is quad stack.
Smaller GPUs usually have a larger proportion of their area taken up by controllers and hardware outside of the primary graphics area, so even if smaller on absolute terms, a 256-bit GDDR6 setup can have more of an impact than a quad-HBM2 Vega 7, depending on other factors.
However, one of the other factors is that Vega 7 has an unusually large amount of area around the graphics core, which seems to contribute to the area bloat versus an ideal shrink from 14nm.
One of the possible residents on the die is additional infinity fabric blocks to connect the other HBM controllers, and possibly more mesh connecting the two sides--taking up non-zero area. GDDR6 likely takes up 3 or possibly 3.x sides, and may require a more sprawling interconnect.
The L0 operand/result cache would lower utilization of the SRAM register file allowing more ALUs to be served by the same register file, or at the very least save a lot of power. It would be interstring to know if conflicts are handled in hardware or in software. If it is software, scheduling would be isolated to a single shader program, if it is in hardware you could have independent shader programs running on the same CU. Maybe the super SIMD extension is a means to utilize spare register file bandwidth (because request/stores are served by the L0 operand cache).
Cheers
There's comments about bank conflicts that seems to indicate that there's a best-effort attempt at gather operands by the CU, and if the conflicts are significant enough there will be stalls. There's no indication of any encoding changes for the instructions to indicate that software has any other means of handling conflicts besides paying attention to the register IDs that belong to the same banks.
https://github.com/llvm-mirror/llvm...b3561b2#diff-1fe939c9865241da3fd17c066e6e0d94
(from GCNRegBankReassign.cpp -- note that it's not called RDNARegBankReassign)
/// On GFX10 registers are organized in banks. VGPRs have 4 banks assigned in
/// a round-robin fashion: v0, v4, v8... belong to bank 0. v1, v5, v9... to
/// bank 1, etc. SGPRs have 8 banks and allocated in pairs, so that s0:s1,
/// s16:s17, s32:s33 are at bank 0. s2:s3, s18:s19, s34:s35 are at bank 1 etc.
///
/// The shader can read one dword from each of these banks once per cycle.
/// If an instruction has to read more register operands from the same bank
/// an additional cycle is needed. HW attempts to pre-load registers through
/// input operand gathering, but a stall cycle may occur if that fails. For
/// example V_FMA_F32 V111 = V0 + V4 * V8 will need 3 cycles to read operands,
/// potentially incuring 2 stall cycles.
As far as L0, there's more than one context that term has been used. If discussing a destination cache in register file patents, AMD described the output flops of the register file as serving as an L0 for repeated accesses to the same ID, rather than describing the register output cache.
The L0 in the LLVM changes appears to be a CU-local memory pool that plays a role in memory access ordering and can impact data visibility to wavefronts in other CUs, which seems distinct from the question of augmenting the register file and result forwarding within a SIMD.
Another wild arsed guess: there will not be a three-operand fetch from the register file for any instruction. A maximum of two instructions will come from the register file and the other has to come from operand cache (e.g. for FMA).
It's basically pointless in this modern era to make the register file support three-operand fetches when there's so few instructions that can use three operands.
Additionally, operands that go to cache but never need to go to the register file (having a short lifetime, e.g. one or two cycles) save write power/bandwidth to VGPRs.
There seems to be an implication from the above code comments that there are 4 banks of vector register file, and unlike prior GCN architectures it's not a foregone conclusion that an instruction has guaranteed access to them. Going by the description of the stall behavior, it's possible that an FMA could source 3 in the same cycle with the appropriate register allocation pattern.
A significant motivation for the super-SIMD patent is to use the lost operand access cycles and this could entail dual-issue or faster issue latency. The odd way the GFX10 changes document latencies may be consistent with something along those lines.
What if... the L0 operand/result cache is connected directly to the HBCC …?
I'm not clear on the full purpose of it, but it's called an L0 and there's mention of an L1 as well. Possibly, there's an L2 or something else beyond. The HBCC in Vega is past all the cache layers, and since its job is paging resources into the local VRAM pool it's not specced to handle something like all the local cache output of the CUs.