AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

Thread Status:
Not open for further replies.
  1. Esrever

    Regular

    Joined:
    Feb 6, 2013
    Messages:
    846
    Likes Received:
    647
    Since the data that is scraped from firmware are pointing to both HBM and GDDR6, would it be possible to connect HBM and GDDR6 in 1 memory system? Memory coherency has advanced a lot and AMD has already done the SSG with NAND connected to be used as VRAM. Can they possibly make a memory system with both HBM and GDDR6 as 1 coherent memory system?
     
  2. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Why not. Intel already did similar thing with Knights Landing and MCDRAM.
     
  3. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    If I understand things correctly, this was supposed to be in "pre-qualification" earlier this month, so it would be quite a shock to see it in an actual product this year.
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I've misunderstood fan out. This article helps:

    https://semiengineering.com/momentum-builds-for-advanced-packaging/

    So this would appear to have no substantial impact on a GPU die's active area.
     
    BRiT likes this.
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I got bored:

    [​IMG]

    The analysis excludes PHY, MC, IO and media blocks in XSX die shot. i.e. to come up with a die size for Navi 21, you would need to add those to whatever numbers you might derive from above.

    Total GPU does include command processor, L2 (5MB) and RBEs. Navi's RBE area would be "correct" if we assume it has 64 RBEs, but if zixel rate is doubled, I dunno how much extra space that would be. Adjust for 128 RBEs if you like - if you can find them on the die (I have a good idea, but got lazy).

    Navi's 4 shader engines are going to add more area and I don't know how to account for that.

    I've not done this to support the 128MB Infinity Cache rumour, just felt like having some fun.
     
    Alexko, Pete, Newguy and 8 others like this.
  6. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    415
    Likes Received:
    379
    GPU L2 has plenty of logic though. Atomics, dirty byte tracking, compressor, etc.

    The Zen 2 L3 block (vanilla 7nm) is another way to estimate the size of a hypothetical 128MB SRAM cache. 16MB is ~17 mm^2 (tags included though). Eight of these puts the tally at ~130 mm^2. Now depending on whether you think Navi 2X will use 7+ (EUV), and whether you think it is a memory pool or is extra capacity for the L2 cache, the number could go further down.
     
    #3366 pTmdfx, Sep 28, 2020
    Last edited: Sep 28, 2020
    Pete and BRiT like this.
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    pTmdfx, I've taken your hint about CPU L3 cache, so using the L3 cache that's on the same die shot:

    [​IMG]

    it turns out that this L3 is very compact. The change in size for the "128MB Infinity Cache" is crazy.

    In case you're wondering, the die image I'm using is from this slide:

    [​IMG]

    Which is from:

    https://www.eurogamer.net/articles/digitalfoundry-2020-xbox-series-x-silicon-hot-chips-analysis

    I'm still sceptical about this concept, but cache hit rate in the many rendering passes used per frame does appear to be a concern. As the rendering pass count rises and the count of target GPUs rises, it gets harder to justify to developers: "optimise your memory access patterns like this for this GPU".

    Ray tracing might be the killer app?
     
    Alexko, Pete, Lightman and 4 others like this.
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    This is wrong! 4MB of L3 per CPU cluster, not 8. Sigh:

    [​IMG]
     
    Alexko, Pete, Krteq and 3 others like this.
  9. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    415
    Likes Received:
    379
    Have you included the L2 cache by accident? The numbers seem a bit too high IMO. Check the ISSCC 2020 deck for Zen 2 CCX floorplan.
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    This is how I've defined L3:

    [​IMG]

    As I understand it, this is 4MB, and it takes up 5.4mm².
     
  11. Frenetic Pony

    Regular

    Joined:
    Nov 12, 2011
    Messages:
    807
    Likes Received:
    478
    So wait, the claim would then be 173mm die space for 128mb of l3. Which would need to deliver some ridiculous IPC somewhere, as another 256bits of bus width would only take up what, about 80mm(ish?) according to the below, and that would be all that's needed to get the big one working with a standard memory configuration. AFAIK with all the latency hiding and serialization there's no way that would be worth the tradeoff. Even for raytracing, the slowdown should come from partially occupied wavefronts due to poor locality, even if you strip the latency down a lot. At least, I don't see how some standard l3 cache alone would still be worth it.

    [​IMG]
     
    Lightman likes this.
  12. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    L3 has the added benefit of consuming much less energy per byte fetched, which can lead to higher clocks for a given power budget.

    Still, my money's on off-die, but on-package cache with something denser than SRAM—if there really is a very large cache, that is.
     
    Lightman likes this.
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    The problem we have is that the rumoured ~500mm² Navi 21 is ludicrously over-sized for a "double Navi 10".

    Using this really nice die shot (thanks for this, I was just about to go hunting for it) PHY + MCs look like they take ~64mm².

    My analysis (note L2 is tricky, there's two variants, and I think the "small block" variant, based on 4-repeated slices is more likely correct - MB/area is similar to XSX, too):

    [​IMG]

    Even a 512-bit die has a lot of "missing" area, ~29mm², and that's with a naive doubling of uncore (GPU logic outside of shader engines) area.

    It seems that ALU utilisation suffers way more from memory latency than we would have expected (even though GPUs "hide" it). This appears to be because there are so many rendering passes in modern games, and can only be partially accounted for by the spin-up/spin-down of hardware threads:

    [​IMG]

    see slide 34:

    https://gpuopen.com/wp-content/uploads/2018/05/gdc_2018_sponsored_engine_optimization_hot_lap.pptx

    RDNA attempts to improve ALU utilisation by scheduling hardware threads for minimal duration, theoretically to maximise coherent use of memory (cache and LDS). What we don't have, as far as I can tell, is an analysis of ALU utilisation in RDNA.
     
    #3373 Jawed, Sep 30, 2020
    Last edited: Sep 30, 2020
    Alexko, Lightman, Pete and 3 others like this.
  14. SimBy

    Regular

    Joined:
    Jun 21, 2008
    Messages:
    700
    Likes Received:
    391
    It's not just N21 that's ridiculously oversized. N22 with 40CUs and 192bit bus is rumored to be 340mm2.
     
  15. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    Isn't dark silicon to improve thermals and reduce signal interference at higher clocks, a much simpler explanation for the large die size?
     
  16. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    627
    Likes Received:
    414
    And the most sensible thing to fill that dark silicon is cache.

    It's important to note that dark silicon will basically never mean areas of the die left literally blank. It just means you have to design your system so that not all of it can be switching at the same time. Pretty much the archetypical not-often-switching large structure is a block of cache.
     
    Alexko and Lightman like this.
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    In my analysis, a "256-bit double Navi 21" comes out at ~363mm². Wouldn't it be funny if the rumoured die sizes were all for the "next chip up in size"...
     
    LordEC911 likes this.
  18. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    415
    Likes Received:
    379
    Also worth mentioning also that GPU kernels have more explicit say in cache policies than typical CPU cores.

    Say RDNA/GCN allows you — for every request — to alter L0, L1 and L2 policies by choosing different combos of GLC, DLC and SLC bits.

    So presumably, if the cache hierarchy is getting a big capacity boost, the shader compiler would likely get a big complementary upgrade too. One could go as far as JITing shaders with live profiling data to, say, make resource accesses with high L2 read miss rate to skip L2 for all reads in the future.

    Then of course, the option of large SRAMs being a backing memory pool (“eSRAM” in Xbox One) is also on the table. Such pool could be controlled by either “HBCC” kind of thing (hardware assisted page cache, supposedly), or 100% software (driver). This also shifts the issue to the virtual memory management, from the hardware caching realm.
     
    #3378 pTmdfx, Sep 30, 2020
    Last edited: Sep 30, 2020
    Lightman likes this.
  19. More GDDR6 PHYs alone don't give the chip more effective bandwidth. They'd need to pair it with more memory chips, which come at a (very unpredictable, lately) cost. Plus, it seems that GDDR6 is especially picky in regards to signaling and PCB placement, which is probably why 384bit arrangements are now reserved for >$1400 graphics cards (nvidia had 384bit GDDR5 cards in the $650 range).

    A bigger chip with narrower memory bus seems like a safer bet (if it's effective). IHVs can usually decrease and control the cost of those bigger chips as yields improve and waffers get cheaper, but when the time comes to renew DRAM supply contracts they have no such control over external pricing.
     
  20. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    Aren't caches fairly large energy consumers?

    I try to familiarize myself with the creative ideas that have been brought forward regarding dark silicon. This one's an interesting overview:
    A landscape of the new dark solicon design regime

    Spatial and temporal switching, seems streight forward an idea. I assume there are severe placement solving problems for spatially active regions, so some local optimum may be still (say) 10% larger than a globally optimal solution, which in turn is still a 25% larger than an impossible ideal solution. 3D would allow further exploitation ofc.

    What I find super interesting is the suggestion of more FF (or C-cores) to fill up the space, much easier to layout spatially, temporal exclusivity is almost a given, as GPUs are until now not truely super-scalar, and spatially these regions are neighbours as they share the data-paths.

    Because you mentioned caches, and I tried to understand the energy-profile of it (didn't really find anything besides 6% D-cache and 21% I-cache energy contribution to instruction execution, which sounds a lot, but I guess the alternative is much worst), there are these switchable cache-configs:
    Switchable cache: Utilising dark silicon for application specific cache optimisations

    I only read the abstract, but I find this tempting for a GP-GPU, as the different workloads und utilization types certainly have different characteristics in regards to data-access (especially the difference between a BVH-data request and a swizzled texture-data access).

    But then, intuitively I believe all the proposals to describe too risky, too complex, ideas/solutions. I personally believe the answer is a very simple one.
     
    Pete and Lightman like this.
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...