AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,363
    Likes Received:
    3,944
    Location:
    Well within 3d
    For clarification, GCN's instruction issue hardware could issue up to 5 instructions of different types by selecting one from up to five wavefronts per cycle on the SIMD whose scheduling cycle has come up. This is not including special instructions that could be automatically consumed by the per-wavefront instruction buffers.

    I'm assuming there's parallel issue from multiple wavefronts with RDNA, although it's possible the specifics about the width or types might have changed.
     
  2. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    388
    Likes Received:
    331
    Yes, I remember this claim from GCN docs when it was revealed for the first time. I find it highly misleading considering wave scheduller throughput is 1 wave per cycle and most of these instructions of different types obviously share data paths with FP32 units, hence INT execution will block FP32 from execution on the same data path, the same with SFU and so on. Scalar units have separate data paths, but with 1 wave per cycle, issuing a scalar op will block a SIMD op (fp, int, sfu) from execution. Depending on scalar cores throughput (i guess it's an instruction per cycle), there will be 0 interleaving with SIMD ops. If throughput is lower, lets say 1/2 ops per cycle, then there will be overlap for 1 cycle with SIMD units.

    Don't get me wrong, GCN obviously can execute up to 4 waves concurrently with just a single scheduler per CU because of the 4 cycle cadence Wave execution on SIMD units, but unlike Turing, it can't execute different types of instructions (different mixes of fp + int + sfu) at full rate concurrently because they all share the same data paths and scheduling rate is not up to speed to meet separate data paths requirements
     
    #1122 OlegSH, Jun 30, 2019
    Last edited: Jun 30, 2019
  3. Betonmischer

    Newcomer

    Joined:
    Jun 30, 2019
    Messages:
    17
    Likes Received:
    33
    I've found this in an old GCN slide deck. It shows an independent instruction issue and decode datapath for the scalar unit. Therefore, GCN is perfectly able to execute a scalar instruction every cycle without sacrificing any of the SIMD throughput. From the information I was able to gather, it works like this: each cycle that the vector scheduler is issuing an instruction to one of the SIMDs, the scalar scheduler has an opportunity to issue an instruction to the scalar ALU. The caveat is both kinds of instruction must belong to different wavefronts chosen from the pool of up to 10 wavefronts that are associated with each SIMD, as GCN lacks the ILP capability.

    If I am correct in my assumptions on GCN, it wouldn't make sense for RDNA to be inferior to GCN in this regard. The RDNA slides clearly show a certain degree of interleaving between SIMDs and SFUs (each cycle an SFU instruction is issued by the scheduler, the corresponding SIMD stays idle), whereas both the scalar unit and the SIMD can be issued instruction every cycle.

    1.jpg 2.jpg 4.png 3.png
     
    w0lfram, Lightman and Ike Turner like this.
  4. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    868
    Likes Received:
    276
    I want to share my experience regarding register allocation. The compiler doesn't minimize the number of allocated VGPRs currently (for the pre-Navi hardware I could analyze) and leaves it at that, but instead it maximizes VGPRs to the next high power-of-two of the mimimum allocation. As an example, I tried to optimize a compute shader for occupancy, it was basically single wavefront without being able to swap it out. I statically calculated the number of needed registers myself and it ended up needing 3 registers more than 32, instead of leaving it at 35 the compiler skyrocketed to 65 (which for some oddness is considered a 64 VGPR wavefront, probably one was some sort of pseudo VGPR eliminated below the ISA at real machine code). The analysis can be conducted with the tools Rys' team is publishing, e.g. some options allow you to see register-allocation and reuse over the dump of the ISA of the shader on every instruction.
    I think the register allocation allowance has changed a lot for Navi. AFAIU from remarks here, power-of-two isn't necessary anymore, so this kind of thing you describe became only necessary from Navi on. If the older architectures could potentially be called in a similar way is an interesting question. Technically it's a scheduling optimization problem, which is easily approachable in software (unlimited complexity), but not so much in hardware. It could mean the schedulers became more powerful in regards to solving these type of "searches", or the problem is directed towards the driver and more complex software. Ah, the eternal issue of static vs. dynamic optimization (compile-time vs. run-time).
     
  5. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    388
    Likes Received:
    331
    There is an independent decode logic because scalar unit has its own set of instructions (as well as caches and registers). However, scalar instructions are just instructions for a wave with uniform values. The way it's done right now is that waves with just a single work item or with uniform values are determined at compiler time and converted to scalar unit instructions (memory address calculations for example), but these still require launching a wave and next instruction in the wave can be dependent on previous one, so you can't issue 2 instructions to 2 independent data paths for the same wave unless you are sure both instructions are independent (this in order will require additional compiler instruction scheduling and dual-issue related hw logic). There is no evidence of superscalar instruction execution in GCN, in fact, GCN has always been known as a scalar architecture.
     
    #1125 OlegSH, Jun 30, 2019
    Last edited: Jun 30, 2019
    pharma likes this.
  6. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    75
    Likes Received:
    113
    Remember, it's splitting a big register bank between some number of wavefronts/threads. This means the number of registers to a wavefront needs to divide the number of registers in the register bank or else there will be left over ones that are just wasted.

    As for asynchronous compute, this is just another complication - the optimal number of registers for a kernel can be different depending on whether other kernels with differing register counts are likely to be running on the same compute unit (I think is what AMD calls them). Actually, something interesting to do from the driver side would be live profiling. Identify register and cache pressure of kernels at runtime and for later invocations, assign them to compute units accordingly, as well as perhaps switching to a different kernel binary tailored to specific register counts. Asynchronous compute is a performance loss if it results in thrashing caches.
     
  7. Per Lindstrom

    Newcomer Subscriber

    Joined:
    Oct 16, 2018
    Messages:
    33
    Likes Received:
    29
    Betonmischer likes this.
  8. Betonmischer

    Newcomer

    Joined:
    Jun 30, 2019
    Messages:
    17
    Likes Received:
    33
    The idea is that both the SIMD and the scalar unit can each take an instruction from two separate wavefronts, as opposed to a single wavefront, which indeed would make GCN a superscalar architecture. Whether that's the case or not, boils down to the question if GCN has a fully independent scheduling logic for the scalar unit. I'm trying to reach out to AMD for the clarification.
     
    OlegSH likes this.
  9. Betonmischer

    Newcomer

    Joined:
    Jun 30, 2019
    Messages:
    17
    Likes Received:
    33
    Personally, I'm in favor of blower coolers. They're more robust mechanically compared to a lot of axial designs, if not as good in terms of acoustics.
     
    Per Lindstrom likes this.
  10. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,515
    Likes Received:
    934
    Any idea what those bolded features actually are? I don't remember reading anything about them before, curiously enough.
     
  11. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,087
    Likes Received:
    2,952
    Location:
    Finland
    Radeon Image Sharpening
    upload_2019-7-1_0-9-44.png
    Performance loss 0.x - 2 % or so

    FidelityFX is / will be collection of "image improving" and possibly other features / effects

    upload_2019-7-1_0-10-47.png

    upload_2019-7-1_0-11-21.png
     
    Per Lindstrom, w0lfram and BRiT like this.
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,363
    Likes Received:
    3,944
    Location:
    Well within 3d
    AMD's description has been pretty open about the 1-instruction per wavefront per cycle behavior. It is true that vector FP and vector INT do not issue concurrently, since the architecture does not consider them a separate type.
    The scalar unit's domain is separate, and the other types tend to have different domains or units handling them. Branch, GDS/export, LDS, vector memory, and special instructions round out the other general types as far as the original GCN architecture description states.
    While the data paths are not sized for an arbitrary combination of 5 instructions, there are allowances made for some concurrent issue. AMD's general architectural thrust has been to give the SIMD the best chance at good utilization, but there are signs that this includes making sure at least some of the other operation types can issue at a reasonable rate.
    For Navi, the LLVM changes note that the scalar register file is more heavily banked than the vector file. This makes sense since there are multiple types of instructions including vector ones that can source from that file.
    Prior GCN designs with the 4-cycle cadence also left the vector register file available for something other than a vector instruction at least 1/4 of the time. An FMA with three read operands in a 4-cycle cadence allows for another unit to tap into the register file. While AMD's super-SIMD patent may not apply to Navi's implementation, it notes that on average vector instructions would use the equivalent of 2 operands, meaning in many cases the register file was available for two operands to any other instruction type that might need a vector operand (export, VMEM, LDS, etc.).
    If a vector instruction sourced a scalar register (unclear if so, though the bus that allows for this is mentioned as being its own entity), then that might have been another opportunity for the register file to find concurrent use.

    A major limiting factor is operand bandwidth from a register file that is physically challenging to implement. For Turing, this is also a major consideration.
    Both architectures have 4-banked vector register files. With Navi, this banking was exposed due to the breaking of the 4-cycle cadence, and the same rule of one operand per bank per cycle is now a hazard both make software deal with.
    A significant feature Nvidia's GPUs have relied upon since Maxwell is the operand reuse cache, since it helps get around bank conflicts and provides bandwidth the register file lacks. There's a brief mention of something that might be similar for Navi in the LLVM flags, but it's unclear if it plays a similar role.

    If this is a prior GCN GPU, the restriction is that it cannot be from the same wavefront. If any of the up to 9 other wavefronts on the SIMD have a scalar operation pending it will probably be issued.
    How many candidate wavefronts Navi has to choose from on a given SIMD, and how many per cycle it can select isn't something I've seen in the presentations.
    Fully dedicating a scalar unit and scheduler to a SIMD can provide more opportunities for finding instructions without resource conflicts between the scalar and vector paths, and combining the hardware budget of two SIMD16 units should give it the ability to have more wavefront slots than before. However, the supposed streamlining of the architecture for clock speed and efficiency may put some downward pressure on the totals for a SIMD32 versus 2 SIMD16 blocks.
    Also unclear from the Navi presentation is why the branch/message block in the GCN diagram went away for Navi. Most of those instructions still exist, although the loss of some instructions may allow for a shift in what categories or relative issue rates Navi's scheduler has.

    It can make forward progress on more than just four wavefronts in a CU, although it seems the types you are focusing on are architecturally in the same bucket for GCN.
    The Navi presentation did mention that the SFUs use one issue cycle and then continue in parallel, which has some similarity with how Turing handles FP and INT instruction issue.

    It's difficult to say why the compiler bumped the allocation that high, though there are multiple programmer posts on this board and even references in AMD's GPUOpen site from developers that reference poor compiler allocation behavior for GCN.

    The hardware itself doesn't have that extreme of a granularity.
    https://gpuopen.com/amdgcn-assembly/
    https://llvm.org/docs/AMDGPUUsage.html
    The above give a VGPR granularity of 4, and 16 for SGPR for GCN3 (albeit a given shader will see its allocation doubled). It's 4 and 8 for older GCN architectures.
    From the second link, Navi's VGPR granularity goes to 8. However, there's not so much a scalar granularity so much as there's just 128 registers at all times.
    It has been the case that GCN's scalar register file has accumulated more and more context as more complex trap handling, complex addressing, and debugging have been added.
    The large base allocation may also have some influence as to why the scalar path was duplicated and made per-SIMD.
     
    w0lfram and Lightman like this.
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,363
    Likes Received:
    3,944
    Location:
    Well within 3d
    I would disagree about it being called superscalar if the instruction is selected from a different wavefront. Superscalar is a descriptor for architectures that can improve single-threaded execution, mostly if the hardware can analyse the instruction stream for dependences on its own.
    Picking from two separate threads removes the need for the analysis that superscalar processors perform.
     
  14. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    966
    Likes Received:
    1,092
    I had such cases also quite often (2-3 years ago), and this one i've kept in comments for amusement:

    #elif 0 // 21 VGPR
    lds[(((lID >> 0) << 1) | (lID & 0) | 1)] += lds[(((lID >> 0) << 1) | 0)]; BARRIER_LOCAL
    lds[(((lID >> 1) << 2) | (lID & 1) | 2)] += lds[(((lID >> 1) << 2) | 1)]; BARRIER_LOCAL
    lds[(((lID >> 2) << 3) | (lID & 3) | 4)] += lds[(((lID >> 2) << 3) | 3)]; BARRIER_LOCAL
    lds[(((lID >> 3) << 4) | (lID & 7) | 8)] += lds[(((lID >> 3) << 4) | 7)]; BARRIER_LOCAL
    lds[(((lID >> 4) << 5) | (lID & 15) | 16)] += lds[(((lID >> 4) << 5) | 15)]; BARRIER_LOCAL
    lds[(((lID >> 5) << 6) | (lID & 31) | 32)] += lds[(((lID >> 5) << 6) | 31)]; BARRIER_LOCAL
    lds[(((lID >> 6) << 7) | (lID & 63) | 64)] += lds[(((lID >> 6) << 7) | 63)]; BARRIER_LOCAL
    lds[(((lID >> 7) << 8) | (lID &127) |128)] += lds[(((lID >> 7) << 8) |127)]; BARRIER_LOCAL
    #elif 1 // 6 VGPR
    if (lID<128) lds[(((lID >> 0) << 1) | (lID & 0) | 1)] += lds[(((lID >> 0) << 1) | 0)]; BARRIER_LOCAL
    if (lID<128) lds[(((lID >> 1) << 2) | (lID & 1) | 2)] += lds[(((lID >> 1) << 2) | 1)]; BARRIER_LOCAL
    if (lID<128) lds[(((lID >> 2) << 3) | (lID & 3) | 4)] += lds[(((lID >> 2) << 3) | 3)]; BARRIER_LOCAL
    if (lID<128) lds[(((lID >> 3) << 4) | (lID & 7) | 8)] += lds[(((lID >> 3) << 4) | 7)]; BARRIER_LOCAL
    if (lID<128) lds[(((lID >> 4) << 5) | (lID & 15) | 16)] += lds[(((lID >> 4) << 5) | 15)]; BARRIER_LOCAL
    if (lID<128) lds[(((lID >> 5) << 6) | (lID & 31) | 32)] += lds[(((lID >> 5) << 6) | 31)]; BARRIER_LOCAL
    if (lID<128) lds[(((lID >> 6) << 7) | (lID & 63) | 64)] += lds[(((lID >> 6) << 7) | 63)]; BARRIER_LOCAL
    if (lID<128) lds[(((lID >> 7) << 8) | (lID &127) |128)] += lds[(((lID >> 7) << 8) |127)]; BARRIER_LOCAL
    #endif

    This is a prefix sum in a 128-wide workgroup. The first one used much too many registers for what it does, the second one fixed the issues by adding useless branches and the compiler seemed too dumb to know they always must be true :) (lID is the local thread index)
    It took me some time to find a 'fix', writing a looop did not help for example.
    However, this was OpenCL compiler. I did the same in Vulkan and no issues there. (VK was in general about 15% faster but at the time there were no profiling tools to check register usage.)
    It seemed the OpenCL compiler did not bother about the number of registers in regard of occupancy. More often than not i missed a better tier just by one damn register. Still, overall the performance on GCN was always better than anywhere else for me.
     
    Lightman likes this.
  15. Betonmischer

    Newcomer

    Joined:
    Jun 30, 2019
    Messages:
    17
    Likes Received:
    33
    Perhaps my wording should have been clearer on this, but I was referring to GCN being able to execute two successive instructions from a single wavefront as superscalarity. Which is obviously not the case, as two wavefronts (if the whole assumption of the vector and scalar instruction co-issue is correct) are required for concurrent execution on the SIMD and the scalar unit.
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,363
    Likes Received:
    3,944
    Location:
    Well within 3d
    In that case I misread which item the superscalar description was being applied to.
     
    Betonmischer likes this.
  17. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    384
    Likes Received:
    389
    Slide 16 shows that not only SFU, but also regular vector (and possibly scalar) instructions are pipelined from the same wave if there is no data dependency.

    What's unclear though, is how many instructions per wave can be in flight concurrently, or in other terms how many dependencies can be tracked before a pipeline stall is enforced due to exhausted scheduling resources.
    Are you suggesting the vector register file banks in Navi have really just single read port?
    That's quite difficult to believe, as it would result in 2 or 3 cycles per instruction just for operand gathering (at least when not utilizing a different workaround like the operand cache you mentioned), even in the best case. And doesn't fit the advertised instruction throughput of 1 vector instruction per cycle either, which results in an effective 2 or 3 reads and another write required per cycle.

    If there actually is some sort of bank conflict, it's more likely that it only costs extra if you happen to accidentally gather from more than a single bank in a single instruction. And multiple reads from same VGPR bank are handled in same cycle by use of multiple ports.
    Mind linking to the LLVM commit which you think referred to bank conflicts?
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,363
    Likes Received:
    3,944
    Location:
    Well within 3d
    The LLVM changes have a speed model section that detail some of the latencies. For vector operations, the latency is 5 cycles (one worse than prior generations). If tracking vector writes, there's one destination register per instruction, so up to five registers may be in flight as far as the software-visible pipeline latency is concerned.
    This is supported by AMD's slide showing instruction issue for a code sample, with Navi showing an extra cycle needed to handle dependent instructions. For the scalar to vector dependence in that slide, there was a two cycle latency.

    Yes, and I think this is stated in the LLVM changes in various places. GFX10 has an architectural feature flag stating it has a banked register file.
    There's a new set of routines used to handle register bank selection with an eye for avoiding bank conflicts.
    https://github.com/llvm-mirror/llvm...b3561b2#diff-1fe939c9865241da3fd17c066e6e0d94
    The code explicitly points out the restriction in GCNRegBankReassign.cpp:

    Code:
    /// On GFX10 registers are organized in banks. VGPRs have 4 banks assigned in
    /// a round-robin fashion: v0, v4, v8... belong to bank 0. v1, v5, v9... to
    /// bank 1, etc. SGPRs have 8 banks and allocated in pairs, so that s0:s1,
    /// s16:s17, s32:s33 are at bank 0. s2:s3, s18:s19, s34:s35 are at bank 1 etc.
    ///
    /// The shader can read one dword from each of these banks once per cycle.
    /// If an instruction has to read more register operands from the same bank
    /// an additional cycle is needed. HW attempts to pre-load registers through
    /// input operand gathering, but a stall cycle may occur if that fails. For
    /// example V_FMA_F32 V111 = V0 + V4 * V8 will need 3 cycles to read operands,
    /// potentially incuring 2 stall cycles.
    I've referenced patents for super-SIMD and a register destination cache where AMD cites a banked register file. Also notable is that this is how AMD described future and existing GPUs. This actually makes sense with the 4-cycle cadence, since a bank can be read over three cycles without worrying about conflicts. Navi breaks the issue cycle restriction of the old cadence, but does not improve the delivery rate (edit: latency) of the SRAM banks (actually a bit worse now).
    Nvidia's generally the same. I think it stopped hiding this after G80, and actually had more banks until more recent GPUs settled on the same modulo-4 assignment Navi adopted.
    For AMD, I remember descriptions of its use of banked register files going back to R600. Going that far back, AMD actually released a set of test loops for throughput that would ironically now seriously hurt throughput because they used the same register IDs over and over.
    (edit: https://forum.beyond3d.com/posts/994348/ --although it's been a long time and I may have misremembered how simple a lot of the loops were. They use few IDs, but some operations are so simple that they use the same register for all operands, and those I'd imagine Navi's hardware has no problem with.)
    One caveat about GFX10 is that while there is a mention of a register cache, its effect isn't noted for register bank selection unless it's in the nebulous "operand gathering" claim.

    That's what's state above. For prior generations, I've speculated before that this was likely the case, and that it's actually handled in a longer pipeline ahead of actual ALU execution. There are instances cited in the ISA doc about how long it takes to effect certain instruction fetch behaviors where there's a delay before it propagates down to the wavefront's execution, basically where a mode change cannot stop instructions already in those preliminary pipeline stages.
    With the 4-cycle cadence, a long enough pipeline, and forwarding there's generally little software-visible difference. This sort of allocating a bank to one quarter of a wavefront is also hinted at in the super-SIMD patent as a pre-existing embodiment.

    In this case, it's the other way around. A SIMD can now freely select between 4 banks in the same cycle.
    Some possible reasons why include that AMD's super-SIMD patent blames 1/3 of a CU's power draw on the register file, and each additional port adds a transistor per cell to what is likely an array using 6T cells. These are already very large storage arrays, and also having 3-4 ports per bank is unbalancing given that it's multiplying the peak bandwidth of the register file versus the one SIMD that can use it.

    AMD neglected to mention it, and Nvidia doesn't really talk about it in the marketing either. AMD's instruction issue slide does quietly spread its operands across different banks, however.
     
    #1138 3dilettante, Jul 1, 2019
    Last edited: Jul 1, 2019
    TheAlSpark, Ext3h, w0lfram and 2 others like this.
  19. Per Lindstrom

    Newcomer Subscriber

    Joined:
    Oct 16, 2018
    Messages:
    33
    Likes Received:
    29
    No, sorry, they are new to me as well.
     
  20. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,087
    Likes Received:
    2,952
    Location:
    Finland
    Lightman and Per Lindstrom like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...