AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

Thread Status:
Not open for further replies.
  1. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,405
    Top end card could already be 72 CU with a cut down one being 64.
     
  2. Frenetic Pony

    Regular

    Joined:
    Nov 12, 2011
    Messages:
    807
    Likes Received:
    478
    With only 3 dies I'd expect a good a large discrepancy. It's no different from Ampere, the 3080 should be between 50-70% faster than a 3070.

    The bizarre claims on the bus width now make even less sense now though. If there's some magic ed-ram for "big navi" why wouldn't it be used for the smaller dies as well, just scaled down relevantly?
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    I think your criticism is on-point. AMD piggy-backs on research by people who normally work on CUDA.

    This also sounds like "GCN" to me, though the "sizes" remind me of NVidia.

    I guess this simply refects the compression ratios: render target pixels consume way more cache space than compressed texels. Therefore L1 is biased towards texels.

    A tenet of RDNA is that L2s "only talk to L1s" (though L2s would also need to support clients such as the command processor and media units like h.264 decode). This means the only way the ROPs can see memory is via L1.

    The whitepaper doesn't talk about writes to memory originated by compute kernels (unless I've missed/forgotten) and these would have the same problem. Effectively, it seems, all writes from shader engines are L1-write-through-L2.

    If L1 lines are "address-locked" (no replication anywhere in the GPU at L1 level) then it would seem that compute kernels that scatter or kernels that do a lot of random read-write to memory will benefit as L1 effectively grows in size, even if the latencies are more random than would be seen with replicated L1 lines. In the case of these kernels, replicated L1 lines would, if they are to be re-used at all tend to suffer from invalidations, i.e. the use:invalidation ratio would be "low".
     
  4. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    416
    Likes Received:
    379
    GL1 is bypassed for all CU atomic and write requests (ref: ISA documentation), and any affected GL1 line is invalidated (ref: whitepaper). So GL1 lines are brought in only by non-coherent read misses, and coherent ones would skip GL1 entirely.

    Read-after-write performance does not seem like a primary design goal of the GCN/RDNA cache hierarchy after all. For example, writes can stay in L0 only when it dirties an entire cache line. So scattered reads and writes effectively ends up being served by L2, say even for one wavefronts, or multiple wavefronts in the same workgroup (barrier synchronised).

    Btw, these are the only two legit read-after-write scenarios that I know of, since the only sane sync primitive outside the workgroup is atomics (aside from graphics related ones), which always skips GL1 for L2. Write combining has always been the focus, and the only paper I read that echoes this focus was QuickRelease (2014).

    GL1 (aside from its L2 request arbitrator role) does seem to be mostly amplifying bandwidth for constant, texture and instruction loads.
     
    #3264 pTmdfx, Sep 23, 2020
    Last edited: Sep 23, 2020
    Lightman, Jawed and Ext3h like this.
  5. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,426
    Likes Received:
    909
    Did you mean 3060?
     
  6. Frenetic Pony

    Regular

    Joined:
    Nov 12, 2011
    Messages:
    807
    Likes Received:
    478
    Nope

    3080 has 70% more bandwidth than a 3070 which is stuck at 2080 levels, and the card is very much bandwidth limited in certain title like Flight Simulator. The 50% should be for titles that aren't like Doom Eternal.

    And I do wonder if the spread on the Navi cards will be quite similar. Depending on how fast the big one is it could well be worse, if the big one is faster than a 3090 while midrange is stuck somewhere near an overclocked 5700xt.
     
  7. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,426
    Likes Received:
    909
    Nvidia claimed the 3070 is faster than a 2080ti. I cant see a 3080 being 50-70% faster.
     
  8. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Yes and no. Verbally yes it was claimed it's faster. But on nvidia's own presentation slide the 3070 was actually "only" exactly as fast as the 2080ti. My interpretation would be that you have to include quite some titles which use rtx to get the same average performance as the 2080ti...
     
  9. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,426
    Likes Received:
    909
    I dont doubt the 2080ti being faster in various scenarios. It’s highly unlikely the 3070 ends up around 2080/2080s performance level, which is where it would need to fall for the 3080 to be 50-70% faster.
     
    DegustatoR likes this.
  10. xEx

    xEx
    Veteran

    Joined:
    Feb 2, 2012
    Messages:
    1,060
    Likes Received:
    543
    Idk if it was intentional or not...but at 8:50 you can see big Navi on scene.

     
  11. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    13,878
    Likes Received:
    4,724
    sure its not the radeon 7 ?
     
  12. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland


    Galax puts 3070 slower than 2080 Ti, but what does any of this have to do with Navi?
     
    Lightman, PSman1700 and DegustatoR like this.
  13. Leoneazzurro5

    Regular

    Joined:
    Aug 18, 2020
    Messages:
    335
    Likes Received:
    348
    The mistery deepens (lol if true)

     
    Cyan, Jawed and Krteq like this.
  14. Looks like Vega VII to me.

    I don't know what happened there but the tweet is this:
     
    Lightman and DavidGraham like this.
  15. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    Yep, that's Radeon VII for sure
     
  16. LordEC911

    Regular

    Joined:
    Nov 25, 2007
    Messages:
    877
    Likes Received:
    208
    Location:
    'Zona
    If overclocked 5700xt, you mean ~20% faster, aka rtx2080/super levels, then sure.
     
    PSman1700 likes this.
  17. Leoneazzurro5

    Regular

    Joined:
    Aug 18, 2020
    Messages:
    335
    Likes Received:
    348
    I am sorry for the post, but I don't know what happened (and at work I cannot see tweets, so I could ot see what happened to the post itself..)
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    At the level of L1s with per-channel L2 slices, the architectures generally agree. However, I may have mis-remembered which GPU the paper's analysis resembled more. I think the 2x16 SIMD arrangement, register file, separate texture caches, and 16KB/48KB L1/scratchpad split might be closer to Fermi.

    The L1 is described as being read-only, and elsewhere was presented as only holding data decompressed from DCC. Since the RDNA whitepaper indicated the L1 would absorb a lot of their traffic, how much traffic can it absorb since so much of the ROP traffic bypasses it?

    Another presentation had the DCC hardware absorbing writes from the CUs and RBEs, where the RBEs would writing to the compressor, which is linked to the L2 since the L1 is uncompressed.

    This would seemingly benefit the RBEs, but the major limiters described for them is related to write-heavy traffic. Is there still significant bandwidth amplification for ROP traffic with a read-only L1?
     
    PSman1700 likes this.
  19. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    UAV writes maintaining DCC are now supported, so it's not only the RBEs.
     
  20. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    416
    Likes Received:
    379
    In theory it could act as a victim cache of the RBE caches to exploit temporal & spatial locality of exports. This seems possible especially since each RBE owns their screen space partitions, and therefore cache coherency for render targets across shader engines is presumably a non issue.

    This would however assume that the GL1 cache controller supports cache policies exclusive to the RBEs, so that dirty lines from RBE caches can stay in GL1 (write allocate) while they are on their way to L2. This definitely doesn't exist in the CU ISA anyway, which made it quite clear that all atomics and writes bypass GL1 (write no-allocate + invalidation).
     
    #3280 pTmdfx, Sep 23, 2020
    Last edited: Sep 23, 2020
    PSman1700 likes this.
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...