AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Would the pipes be independent controllers that can reference a shared microcode store, or is there some kind of execution element separate from the pipes that runs the microcode and interacts with the pipes?

    Calling back to the command processor lineage that was previously discussed, all the eras of GPU architecture in the RDNA whitepaper had a command processor or its relatives working in the background.

    Some items that I came across that may not have been discussed in this thread or provide more detail on earlier discussions:

    Fixed-function resources appear to scale with shader arrays rather than shader engines. This sort of offloads to the question of the per-SE limit of CUs and RBEs to what the limit is for shader arrays. I've noted that GCN has always had some concept of shader arrays, and the option of having one or two per shader engine. What the shader engine means architecturally isn't clear if it's not tied to the hardware resources it once was. The old GCN limits are not exceeded on a per-array basis at this time, and the number of shader arrays per chip has not exceeded past GCN limits. Southern Islands mentions shader arrays in its hardware IDs, and Arcturus-related changes note that Vega GPUs have one or two shader arrays per SE.

    In that vein, the L1 is per shader array and helps limit congestion on the L2 by being a single client with 4 links to the L2.
    Question: Prior to this, I feel like it would have been practical if GCN didn't have 64+:16 crossbars between the L2 and its various clients. Would the shader array or shader engine count in prior GCN chips have played a role in determining how many links the L2 had to deal with? How much did Navi change?

    A few items in the CU look like merged versions of the prior SIMD hardware. Each SIMD32 supports 2x the number of wavefront buffers as the SIMD16, and the WGP supports as many workgroups as two GCN CUs.
    Some earlier possibilities about the hardware that were discussed: export and messaging buses are shared between the two CUs, although I'm curious how different that is from earlier GCN--since there was arbitration for a single bus anyway.
    The instruction cache seems to have the same line size and capacity, although in a sign of the times comparing the GCN whitepaper to RDNA shows this same cache supplies ~2-4 instructions now versus the ~8 then.

    GCN could issue up to 5 instructions per clock to the SIMD whose turn it was for issue, with the requirement that they'd be of different type and wavefront.
    RDNA doubles the number of SIMDs per CU are actively issuing instructions, but they issue up to 4 instructions per clock from different types. It's not clearly stated that they'd come from different wavefronts, but I didn't see a same-wavefront example.

    The L1 is read-only, so I'm not sure at this point how many write paths there are to the L2, though this wasn't explicitly stated in prior GCN ISA guides either. It was clarified that the L1 can supply up to 4 accesses per clock.

    Oldest-workgroup scheduling and clauses do seem to point to a desire to moderate GCN's potentially overly thrash-happy thread switching in certain situations.

    128 byte cache lines do have some resemblance to some competing GPUs, although there may be differing levels of commitment to that granularity at different levels of cache.
    Wave32 hardware and dropping the 4-cycle cadence have brought the SIMD and SFU model closer to some Nvidia implementations as well.

    RDNA has groups of 4 L2 slices linked to a single 64-bit memory controller--which for GDDR6 is 4 16-bit channels (another of the "did RDNA change something" items).

    The driver and LLVM code changes reference the DSBR and primitive shaders. There's GPU profiling showing primitive shaders running.
    The RDNA ISA doc points out primitive shaders specifically--and not in an accidental way like a few references in the Vega ISA doc that AMD failed to erase.
    The triangle references seem to be what AMD has done for the fixed-function pipeline irrespective of whether primitive shaders are running.

    What we don't have a clear picture on is how consistently these are working, or how successful they are versus having them off. The DSBR generally had modest benefits, and if it's generally unchanged it might not be newsworthy. If primitive shaders are not considered fully baked, or are also of limited impact, it's possibly not newsworthy or may dredge up the memory of the failed execution with Vega.

    The whitepaper goes into some detail of what the primitive units and rasterizers are responsible for, though that central geometry processor's specific duties aren't addressed.

    The Vega whitepaper gave a theoretical max that probably represented the best-case with a primitive shader, which Vega wound up seeing none of. The RDNA whitepaper may be going the other way and gives what the triangle setup hardware can do as a baseline. I'm a little unclear on whether it can cull two triangles and submit a triangle to the rasterizer in one clock, or if it's a more complex mixture of culling and/or submitting.

    One of the justifications for primitive shaders with Vega was avoiding losing a cycle per triangle that reached the fixed-function pipeline only to be culled. Enhancing the primitive pipeline may somewhat reduce the impact of this argument, though it might not go as far as where Nvidia's task and mesh shaders tend to assume their triangle setup hardware will be capable enough to do a fair amount of culling on its own.
     
    anexanhume, AlBran, Malo and 3 others like this.
  2. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,426
    Likes Received:
    421
    Location:
    New York
    Thanks for sharing.

    I don’t get the emphasis on “dual” compute unit. There are 4 32-wide SIMDs per CU each with their own scheduler, registers and wavefronts. All 4 SIMDs share the LDS and caches.

    So what exactly is dual meant to describe? Is it just that each pair of SIMDs in the CU share a TMU block and there are 2 such blocks?
     
    #1382 trinibwoy, Aug 22, 2019
    Last edited: Aug 22, 2019
  3. techuse

    Newcomer

    Joined:
    Feb 19, 2013
    Messages:
    40
    Likes Received:
    14
    Isnt turing/pascal rasterization rate .5 per GPC per clock? How likely is this to be a bottleneck in
    I thought it was describing the optional wave sizes but someone more knowledgeable than me should def chime in.
     
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,426
    Likes Received:
    421
    Location:
    New York
    My understanding of the scope of a CU/SM in recent architectures is that it’s the unit of hardware that owns the execution of a workgroup/block of threads and has its own pool of LDS/shared memory.

    AMD is counting a dual compute unit as 2 CUs even though all 4 SIMDs appear to share the LDS. I must be missing something.
     
  5. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,319
    Likes Received:
    23
    Location:
    msk.ru/spb.ru
    upload_2019-8-22_14-27-32.png
    https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_7July2019.pdf
     
    trinibwoy and BRiT like this.
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,426
    Likes Received:
    421
    Location:
    New York
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    From the RDNA ISA doc, the LDS itself is implemented as two 64 KB halves, one half is considered local to one CU. The arrangement of the local half of the LDS matches a GCN CU's capacity and banking, so the big LDS lists for the WGP is two "classic" LDS arrays. The big difference is that there is some kind of link between the two so that a wavefront in one CU can read from the more distant LDS in WGP mode--subject to potential performance penalties not otherwise specified.
    From my reading of the ISA doc, the WGP mode allows for a workgroup's LDS allocation to be split across the two halves without changing the maximum allocation a workgroup can make.
    At some point there may be some disclosure for why two variations for GFX10 have a bug flag for LLVM for LDS usage in workgroup mode, and how significant that is. There is for some reason one variant that does not have that flag.

    Aside from that link between the LDS halves, much of the CU layout that the RDNA whitepaper goes into looks a lot like two independent CUs.
    There are some differences like how the dual-CU supports twice as many workgroups as a single GCN CU, though I'm not clear if that means there's a shared scheduling component that tracks twice as many workgroups or we're looking at two CUs with half the total in local hardware that is able to somehow query the status in the other CU. I'm unclear if WGP and CU mode affect this ceiling from the point of view of a CU.
     
    trinibwoy likes this.
  8. Shaklee3

    Newcomer

    Joined:
    Apr 9, 2016
    Messages:
    18
    Likes Received:
    10
    Which one? I haven't seen any hyperscalers even mention they have a working cluster. Others like summit are very open with their designs and use cases. Regardless, if only hyperscalers see it, that likely means it's in short supply.
     
  9. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    166
    Likes Received:
    82
    there is no any info from HotChips about RDNA ?
     
  10. anexanhume

    Veteran Regular

    Joined:
    Dec 5, 2011
    Messages:
    1,493
    Likes Received:
    676
    They released the white paper and the talk will get posted online within the next few months.
     
  11. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    224
    Likes Received:
    96
    My understanding of primitive shader is this:
    It's like a workload balancing. If not enough pixel reach the cu's because not enough triangles are culled and the cu's are empty, why you don't give cu's culling work?

    Now for me this looks like now primitive shader culling is a own fixed shader stage.
     
  12. bridgman

    Newcomer Subscriber

    Joined:
    Dec 1, 2007
    Messages:
    58
    Likes Received:
    102
    Location:
    Toronto-ish
    I don't think you are missing anything. CU's have had 64 CUs since Cayman (VLIW4, 16-way SIMD) although all the GCN ones were organized as 4 16-way SIMDs with a separate scalar ALU, so arguably 65.

    When we doubled the size of each SIMD we could either say "hey CU's are twice as big now" (which would be confusing) or we could talk about 2-CU blocks (which was felt to be a bit less confusing).

    My understanding was that we still wanted to allow 4 SIMDs to collaborate via LDS in order to minimize impact on existing code, but didn't want to confuse customers by making CU's twice as big, so the remaining option was to keep a CU at 64 ALUs (66 now I guess) going from 4 SIMDs to 2, with LDS sharing between 2 CU's to maintain "4 SIMDs per LDS".

    Not sure if that helps or just trowels on another layer of confusion :)
     
    Kej, Lightman, trinibwoy and 3 others like this.
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,426
    Likes Received:
    421
    Location:
    New York
    Yeah I definitely read it as CUs being twice as big on the first pass through the paper.

    I am actually a bit more confused now :) In what way could GCN code be optimized for the number of SIMDs per LDS that wouldn’t map well to RDNA? Is it due to changes in LDS bandwidth / latency with fewer SIMDs?
     
  14. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    417
    Likes Received:
    484
    Any ideas how compute performance could differ from GCN?
    Reading papers i only see potential improvements, but the few compute benchmarks show big differnces in both directions.

    My confusion: VGPRs are not twice as much now, in a way register pressure problems are magically gone now, no?

    The whitepaper lacks info on what affects occupancy now, and if there are serious changes from GCN behavior here. (I assume no, but did not really understand the sections that address occupancy.)

    Sort of killer feature would be the option to double accessible LDS for certain workgroups, while running others that do not use any LDS on on the other half of one WGP. If technically possible, would be worth some work to make it happen!
     
  15. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    513
    Likes Received:
    234
    The software.
     
  16. bridgman

    Newcomer Subscriber

    Joined:
    Dec 1, 2007
    Messages:
    58
    Likes Received:
    102
    Location:
    Toronto-ish
    Sorry, I think I was a bit short of coffee on the last post... you get back compatibility anyways, but you aren't taking full advantage of the RDNA hardware because each workgroup is limited to a single CU. If you run in WGP mode you can spread a workgroup's waves across both CUs and the waves are still able to communicate via LDS etc...

    Coming up with good names for things is harder than it looks :)
     
    Kej, ethernity, w0lfram and 7 others like this.
  17. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    16,042
    Likes Received:
    4,993
    Get some children in a room and ask them to come up with names. Children are GREAT at coming up with random names for things. :)

    Regards,
    SB
     
    Kej, ethernity, w0lfram and 5 others like this.
  18. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    166
    Likes Received:
    82
    So in other words, Vega PS was just a marketing gimmick based on unrealistic numbers..... :wink4: Anyway, it would be interesting to watch how RDNA2 will evolve in the future.
     
  19. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,519
    Likes Received:
    852
    It is not magic, it is an effect of the shorter execution latency. Each instruction result has a reservation in the register file. Since latency in RDNA is one quarter of GCN, register file entries used for temporary results effectively have one quarter the footprint (measured in bytes times cycles).

    For code with lots of long living temporaries the register file is virtually quadrupled (upper bound, grain of salt, mileage may vary etc...)

    Cheers
     
    w0lfram, Lightman, BRiT and 2 others like this.
  20. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    417
    Likes Received:
    484
    I see, so the reduced instruction latency also reduces requiered registers. Did not realize this before, thanks :)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...