AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

Tags:
  1. trinibwoy

    trinibwoy Meh Legend

    Oh right, we did talk about the multi-box patent. That one is interesting because it implies Nvidia is encoding 12 BVH nodes into a single cache line and the RT core can do 12 box tests per cycle which is pretty impressive.

    The sharding patent is reminiscent of Volta's independent thread scheduling. Did that not already make it over to Turing and Ampere?
     
  2. trinibwoy

    trinibwoy Meh Legend

    iroboto, Lightman and PSman1700 like this.
  3. TopSpoiler

    TopSpoiler Newcomer

    Yes, independent thread scheduling is the warp execution model for the Volta and it's successor generations.
    Subwarp Interleaving (3rd link) is the advanced version of it. I recommend that you read the research paper.
     
    pharma and trinibwoy like this.
  4. PSman1700

    PSman1700 Legend

  5. pTmdfx

    pTmdfx Regular

    This part of the pipeline is of variable latency and results are written back asynchronously. Not saying RDNA 3 will or will not do it, but the intersection performance can still be scaled vertically as long as the memory hierarchy can sustain it. Evidently TMU in RDNA1 had doubled the FP16 texture filtering rate (and 4x GCN 1 if my memory served me right), while all macro ratios stayed the same (lane:TMU).
     
    trinibwoy likes this.
  6. pTmdfx

    pTmdfx Regular

    One can also argue that if they are happy to throw xtors at 2x scalar+vector throughput, they might also be inclined to throw xtors at hardware traversal. There is no apparent causation, so extrapolation works both directions. :razz:
     
  7. trinibwoy

    trinibwoy Meh Legend

    Seems unlikely that this has made it into hardware. The paper concludes that today's workloads won't benefit much even when running RT. Reducing latency (bigger caches) or increasing parallelism (bigger register files) may be a simpler solution to the latency problem.
     
  8. Remij

    Remij Regular

    I hate these "leakers" who are full of $#!+ and how they are constantly mentioned and referenced on somewhat reputable websites..
     
    cheapchips and xpea like this.
  9. Jawed

    Jawed Legend

    Putting four SIMDs into a single CU with two CUs in a WGP possibly stretches the crossbars (or ring bus, as you've suggested) and theoretically requires a beefier LDS, along with more L0 cache. It all multiplies.

    A pay-back for more SIMDs inside a CU (or WGP) is that more extensive scheduling hardware is amortised across more compute.
     
    Lightman and iroboto like this.
  10. pTmdfx

    pTmdfx Regular

    They could also double the LDS size without doubling the bandwidth (so 4 SIMDs sharing the existing 2 128B/cycle datapaths). This can at least enable more concurrent workgroups on a CU/WGP.
     
  11. techuse

    techuse Veteran

    I only see >2x possible with absurd power draw or some type of multi-GPU situation akin to Crossfire.
    There is just no way it will be anywhere near 90 TFLOPS in the same vein that a 6900xt is 23.
     
    Last edited: May 1, 2022
  12. Jawed

    Jawed Legend

    Yes, I was referring to a beefier LDS. But it sounds like you might be describing a pair of LDSs, each the size of an RDNA 2 LDS, one LDS being private to each CU.
     
  13. pTmdfx

    pTmdfx Regular

    Not private, or else WGP will stop being WGP. I meant a possibility of keeping the current data path arrangement (2 arrays of 32 32B banks; one shared request/response bus per CU; i.e. 4 SIMDs sharing, up from 2 today), while increasing the size of individual banks (1024 entries from 512).

    I might be wrong about my “near-far” read though. My second thought is that RDNA (2) could be a two-level setup, where each CU gets its own LDS “front end”, independent of the two LDS bank arrays.

    The “front end” handles sequencing and result buffering with also a 32-lane crossbar. Each request would be broken down by array into 1 or more conflict-free sub-requests (bound to a specific array), each of which would also have addresses & data sorted by bank order, before being sent out to the array. This way, an actual LDS bank array can have a very slim control & datapaths around the banks, while it would require a simple(r) 2x2 crossbar for both CUs to have uniform access to both arrays.

    So if it were to be scaled up for 8 SIMDs, it could be extended as a 4x4 crossbar (moving 32x4B=128B lines), with 4 such “front ends” and 4 bank arrays.

    This still might not resemble the actual thing, but is probably closer to the truth than a naive 64x64 or (128x128 for 8 SIMDs) monolithic bank-level xbar. :razz:
     
    Last edited: May 1, 2022
  14. Jawed

    Jawed Legend

    Which is why I shake my head at the continued existence of the WGP + CU combination... Sharding texture L0, TMU and RA within a WGP seems problematic.

    Certainly as the count of client SIMDs for LDS increases, this concept of "bank-alignment coalescing" becomes more attractive. Once the GPU is past simple 1:1 mappings for LDS banks, the variable latencies involved make this coalescing more productive, so more clients = more win.
     
  15. del42sa

    del42sa Newcomer

    https://videocardz.com/newz/amd-rdn...ns-navi-31-with-up-to-12288-stream-processors

    The Navi 31 with 6 Shader Engines, 12 Shader Arrays and 48 Work Group Processors would ship with up to 12288 Stream Processors, that’s a reduction of 20% in core count compared to previously rumored 15360 cores. The same applies to Navi 32, which instead of 10240 cores would ship with 8192 Stream Processors. For the Navi 31 this means 4096 cores instead of 5192
     
  16. trinibwoy

    trinibwoy Meh Legend

    Do people really believe the flagship will have 3 GCDs in the first attempt? It’s much more likely to be 2. AMD likes powers of 2.

    Also there’s no way Navi 33 with 4096 processors is 440mm^2.
     
    Last edited: May 2, 2022
    Entropy likes this.
  17. Explain your train of thought. Navi21 with 5120 units is 520mm² and N6 is only a tiny shrink over N7.
     
    del42sa likes this.
  18. Entropy

    Entropy Veteran

    I’m no trinibwoy, but in comparison to Navi21 on 7nm, the number of CUs has shrunk by 20%, the memory I/O is halved, and the cache may or may not be halved too (I hope to God not given the memory interface). Since TSMC additionally claims that the 6nm tweak offer 18% higher logic density, the potential for a smaller die than 440mm2 seems to be there. Of course, since we don’t know exactly what RDNA3 adds that may require additional gates, or to what extent, it really is anybodys guess at this point.
     
    Last edited: May 2, 2022
    PSman1700 likes this.
  19. trinibwoy

    trinibwoy Meh Legend

    Navi 33 isn't 5nm? I read the 440mm^2 in the videocardz table as the size of the chip not the package.
     
  20. It's on N6, pretty cheap/reasonable for that die size.
    Well yes, all those cuts and then you add back in the extra RDNA3 increases. It sounds very reasonable to me.
     
Loading...

Share This Page

Loading...