AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

Thread Status:
Not open for further replies.
  1. SimBy

    SimBy Regular

    Jesus H Christ. I expected at least 2.2 like PS5, but 2.5GHz?
     
    Lightman likes this.
  2. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■) Moderator Legend Alpha

    Still haven't seen the cooling required for either one, so will have to wait a while more to see how it really plays out.
     
  3. PSman1700

    PSman1700 Legend

    80cu @2.2 damn. Nv has some serious competition then, long over 20TF o assume.
     
    Lightman and Krteq like this.
  4. yuri

    yuri Regular

    There is a guy picking in AMD's firmware blobs. He got some nice info from the description and power tables.

    * Navi 21 - 80CUs, maxing out at 2.05-2.2GHz = 21-22.5TFLOPS
    * Navi 22 - 40CUs, maxing out at whooping 2.5GHz = 12.8TFLOPS
    * Navi 23 - 32CUs, with not yet power tables-driven frequency; apparently this one comes later after 21 and 22.
    * Navi 31 - 80CUs with identical configuration as Navi 21; possibly just place holder data?
    * Van Gogh APU - 1 SA with 8 RDNA CUs = 8 CUs
    * Rembrandt APU - 2 SAs with 6 RDNA CUs each = 12 CUs



    // ok, beaten :]
     
    Lightman, Krteq and PSman1700 like this.
  5. tunafish

    tunafish Regular

    If I understand things correctly, the card will basically never actually run at 2.5GHz. That's the absolute maximum boost limit for when all stars align and power consumption is low and thermals are excellent. That number is not marketed or printed on any box, mainly because realistically no card ever reaches it in any real load. Best possible real-world clocks have typically been ~100MHz below that.

    OFC, 2.4GHz is still pretty monstrous...
     
    Alexko and Lightman like this.
  6. PSman1700

    PSman1700 Legend

    A navi 21/31 looks very tempting, 22+ TF, some OC at that perhaps and some lightning fast hbm ram.
     
    Lightman likes this.
  7. SimBy

    SimBy Regular

    While the clocks are high, memory bandwidth still makes absolutely no sense to me.
     
    xpea likes this.
  8. pjbliverpool

    pjbliverpool B3D Scallywag Legend

    If those specs are true then Navi 22 is pretty disappointing. I'd have hoped the second tear GPU would be a comfortable step above the XSX but this would be barely faster at all.

    EDIT: missed the memory bandwidth, so it would actually be slower than the XSX :/
     
    PSman1700 and BRiT like this.
  9. SimBy

    SimBy Regular

    I don't know. That 40-80 gap seems way too big. Something has to slot in there.
     
  10. I'd assume this will be cut down Navi 22. There will be big empty desert between 40CU 192bit and 80CU 384 bit or whatever. I expect to see 256/320bit full/cutdown versions of Navi 22 and Navi 21
     
    PSman1700, Krteq and pjbliverpool like this.
  11. Jay

    Jay Veteran

    If anything, my takeaway is that memory efficiency has improved, which would make XSS look a bit better.

    But yea, if that bandwidth is true its very surprising.
    I guess their view is its not targeting 4K?
     
  12. Leoneazzurro5

    Leoneazzurro5 Regular

    Well, if you look at the power tables for the Navi 10, the power and clocks are those for the 5700 (base clocks mroeover), not even the 5700XT, and the 5700 reaches these frequencies for sure.
     
    Lightman and gamervivek like this.
  13. eastmen

    eastmen Legend Subscriber

    the navi 21 at 3080 performance and prices would be a good grab and a good change of pace for AMD. IF that information above is correct we could see such a thing happen
     
  14. SpaceBeer

    SpaceBeer Newcomer

    Agree. But if Navi 22, with improved clocks and IPC is ~20-25% faster than 5700XT, it would be at 2080 Super level. So cut-down Navi 21 would be at 2080 Ti / 3070, and that makes sense
     
  15. arandomguy

    arandomguy Regular Newcomer

    If Navi 22 does have an inherent clock advantage over Navi 21 by the amount listed then the gap is really roughly the same as that of GA104 against GA102.

    There is no full consumer GA104 or GA102 but RTX 3070/RTX 3090 TF ratio is 57.1%.

    Using the 12.8/22.5 numbers for Navi 22/21 is 56.8%
     
    no-X likes this.
  16. 3dilettante

    3dilettante Legend Alpha

    I dimly recall that there was discussion either here or elsewhere about what possible sizes of on-die cache would allow texturing behavior to be treated more as a well-behaved working set instead of streaming low-temporal-locality workload in that time period or earlier. I think the values at the time for the working set were then-impossible values in the hundreds of MB.
    Perhaps with some relaxation on conditions, such as in-stack versus on-die and getting within an order of magnitude, it's at least theoretically possible.

    I may have to do more than skim this. There are some interesting questions about the non-disclosed elements of these architectures being probed, such as the number of misses from each cache level and the behavior of the cache protocol.
    I'm curious about some of the assumptions made, and how they may affect the conclusions about simulator accuracy or the calculated values for things like miss-handling capacity.
    The assumed hardware model has cache sizes that match SI, but it is CPU-like in some ways. The L1 latency is listed as 1-cycle, L2 is 10-cycle, and memory is 90.
    In GDC2018, AMD indicated GCN's figures for at each level were ~114/190/350 or 114/76/160 when taking additive latency into account.
    Could the loads be that different and not affect the conclusion of how accurate a given simulation is?

    I haven't thought enough about whether the IPC metric chosen could be influenced by this.

    The proposed cache management scheme seems like a combination of cache pipelining and a small victim buffer seen on some CPUs, with an additional amount of cache line swapping.
    At least in the less-flooded coherent systems for CPUs, I think some level of this behavior is going on as part of the the interaction of the home nodes, caches, and memory controllers. There would be hundred-cycle periods where a significant number of cache lines would be useless, if not.
    The swap function may or may not be done. The extra data movement in at least one direction might be an additional amount of complexity and consumption, at least for CPU-like scenarios.

    I didn't see the reference for where the cache pipeline for the GPU was documented, or rather, how limited the "standard" pipelining is for the GPU L2 based on their claims. Perhaps it's possible that the pipeline needs to be less intensive due to the volume of traffic or other constraints, but the timeline of cache requests and invalidations has a cache line invalidated and idle for 100 cycles because the fetch isn't started until after the current line is fully gone, if my reading of the timing diagram is right. Never mind that if AMD's numbers are at least partly representative, the memory controller latency may be 2x worse.


    It's possible Intel found that the Haswell GPU's ROP throughput was modest enough that it could dedicate the eDRAM to texturing. I'm not clear on whether the ROPs could leverage the GPU's L3 or the main L3, which could buffer their traffic further.
    As far as Crystalwell at 7nm, I think the scaling didn't address that Crystalwell used Intel's eDRAM, which TSMC's 7nm does not have. Then there's the ballpark equivalence of foundry nodes being roughly the same density as Intel's N-1 node (current troubles aside).

    An SRAM-based example might be Zen 2's L3 cache, which I've seen estimated to be ~34mm2 for 32 MB. Granted, there are tags, reliability measures, connectivity, and bandwidth measures that could hurt density, depending on the parameters of this hypothetical 128MB GPU cache. I don't think those factors are enough to counter the lack of eDRAM or node name shenanigans.

    (edit: lost the word shenanigans)
     
    Last edited: Sep 26, 2020
  17. gamervivek

    gamervivek Regular

    Last edited by a moderator: Sep 26, 2020
  18. "num_rb_per_se" refers to Render Backends per Shader Engine? If it's still 4 ROPs per RBE, we're looking at 64 ROPs on Navi 21?
     
    Lightman likes this.
  19. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■) Moderator Legend Alpha

    Most likely limited by memory bandwidth anyways?
     
    Lightman, Picao84 and PSman1700 like this.
  20. trinibwoy

    trinibwoy Meh Legend

    That fillrate though. 128 rops @ 2.2.

    Maybe AMD should market their 8K gaming chops too.

    Edit: Oh it's 64 rops? Why did I think it was 128.
     
    PSman1700 likes this.
Loading...
Thread Status:
Not open for further replies.

Share This Page

Loading...