AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

Tags:
  1. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    19,423
    Likes Received:
    10,317
    Those are obviously the newest announcements.

    XL game is switching from CryEngine to UE5 for ArchAge 2.

    Arkane Studios is switching to UE5 for their next game, Redfall. Their last game, Deathloop, used their in house Void Engine.

    GSC Game World dropped their in house X-Ray Engine in favor of using UE5 for Stalker 2.

    Those are just the ones from off the top of my head without looking up. There's also others that I'm not at liberty to talk about because nothing has been announced.

    Regards,
    SB
     
    Krteq and BRiT like this.
  2. arandomguy

    Regular Newcomer

    Joined:
    Jul 27, 2020
    Messages:
    256
    Likes Received:
    364
    One of the possible shifts along those lines might be EA/Bioware from UE to Frostbite and now back to UE.
     
  3. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,405
    Out of these only Arcane is AAA I'd say and Redfall studio in Austin has been using Unreal for Dishonored 1 and then CryEngine for Prey so it's not really a big win for UE5 IMO, more like a choice of 3rd party tech done for the next project by a studio which always use 3rd party tech.
     
    PSman1700 likes this.
  4. PSman1700

    Legend

    Joined:
    Mar 22, 2019
    Messages:
    7,118
    Likes Received:
    3,092
    History repeats itself ;) I think these kind of things is what drives people/forums, we need doom & gloom to keep things interesting.
     
  5. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    19,423
    Likes Received:
    10,317
    In the Asian gaming sphere (SEA, China) and extending into Russia, XL Games is a AAA developer with development budgets similar to Western and especially Japanese AAA devs.

    Regards,
    SB
     
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    Ubershaders can work for simple visibility queries for RT AO or shadows but aren’t a viable general purpose RT solution. Once you get into multi bounce use cases or path tracing don’t you have to spill state anyway? The only difference is whether you manage it yourself or get help from the hardware.

    The Basemark RT benchmark isn’t pretty but it does some interesting things like reflections of reflections. I’m guessing they’re doing separate passes for each “layer” of reflections.
     
    PSman1700 likes this.
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Like here?:



    Now, whether AMD ping-pongs via memory for each reflection pass, who knows. But it seems unlikely in this demo.

     
  8. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    I probably need to look at it on a bigger screen but I’m not seeing reflections of reflections in that demo.

    It’s easier to see in Basemark.

     
    PSman1700 likes this.
  9. Lurkmass

    Regular

    Joined:
    Mar 3, 2020
    Messages:
    565
    Likes Received:
    711
    I don't think you understood what I meant ...

    [​IMG]

    The reason why shader tables are potentially suboptimal from AMD's perspective is that function arguments could potentially spill into the LDS depending on the number of shaders in the table ...

    Also if you're going to use the argument of performance to discredit AMD's implementation of RT then you might as well do same for every other IHV in regards to legit multi-bounce lighting solutions because virtually no real-time application do multi-bounce lighting without hacks like temporal accumulation or caching ...
     
    Krteq likes this.
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    Got it. So is the issue that there’s no real support for callable shaders and everything is inlined? That would explain AMDs stance.

    Discredit? What I said was that ubershaders aren’t viable as a general purpose RT solution. I suppose you could loop over bounces and try to sort rays within the workgroup on each iteration but that would require shuffling a lot of data around. Also how would it work for transparencies? Ultimately you’ll need some form of recursion either by explicitly writing out state after each bounce or having the hardware manage it for you. Do you disagree with that?

    Yes multi-bounce is slow everywhere. Nvidia recommends a maximum of 2 bounces. First bounce for reflection. Second for shadowing the reflected object. Intel on the other hand seems to be very proud of their coherency sorting hardware for handling multiple bounces more efficiently.

    RDNA 3 needs to tackle 3 RT problems and hopefully does all 3.

    1. SIMD traversal is no good for incoherent ray packets e.g. random sampling for GI. (Nvidia and Intel are MIMD)
    2. Developers must manage coherency of ray casting. Hardware can’t help. (Nvidia is trying)
    3. Developers must manage coherency of shading. Hardware can’t help. (Intel is trying)
     
    TopSpoiler, PSman1700, xpea and 2 others like this.
  11. Lurkmass

    Regular

    Joined:
    Mar 3, 2020
    Messages:
    565
    Likes Received:
    711
    What does "real support" mean exactly ? "Callable shaders" is just another one of the many API abstractions out there ...

    Intel has some limited form of dynamic dispatch (bindless dispatch/shaders) that's exclusive to it's ray tracing pipeline but there's no way to exploit this ability in other pipelines (i.e. the regular graphics or compute pipeline). With AMD, callable shaders or the shader binding table in general is just a compute shaders with big switching statements. I'll admit that it's indeed ideal to have the shaders inlined over there, yes ...

    That's entirely dependent on factors like hardware and material complexity. If we take Quake II RTX as an example of one of the few games with multi-path lighting effects, it's not implemented with either recursion or loops but with a unique PSO per path (may not use an ubershader ?) or with inline RT ...

    Recursion in general is not a hard requirement to implement multi-path lighting effects. We're very far away applying "general purpose" RT solutions as is with modern AAA games with them using roughness cutoffs, simpler material shading, temporal accumulation, avoiding transparent/scattering materials altogether etc since it's way too early to be insistent on anything ...
     
  12. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    Meaning it’s more than just API syntax and is actually scheduled and executed as a composable function ala Intel and also per Microsoft’s intent.

    From the DXR spec: Implementations are expected to schedule callable shaders for execution separately from the calling shader, as opposed to the code being optimally inlined with the caller.

    Right so isn’t it AMD’s choice of implementation that’s imposing additional limitations on the usage of callable shaders?

    Interesting, is that documented somewhere? So there’s no shader table and every nth bounce uses the same nth shader for every ray?

    That’s a bit of a chicken and egg problem. Current games are limited by current hardware. It shouldn’t mean that it’s ok for future hardware to double down on those limitations. I like the way Intel is tackling the problem head on. Of course they have to prove their hardware works but at least they’re being proactive about it.
     
    Lightman and PSman1700 like this.
  13. T2098

    Newcomer

    Joined:
    Jun 15, 2020
    Messages:
    55
    Likes Received:
    115
    I'm a bit late to the party here, but 1x MCD + 2x GCD + 4x stacked 3D V-cache die seems not out of the realm of possibility either, given that they've got a shipping product already with the stacked cache.
     
    Lightman likes this.
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    It's an interesting idea that there's one "MCD" and it, perhaps, has all of the GDDR connections. And PCI Express and maybe even some high level graphics command functionality.

    Then each GCD has all of the shader engines (including ROPs and L2s) as well as the cache chiplets stacked on top of them.

    But, infinity cache is an L3 concept, which is localised to GDDR, not shader engine L2. So maybe it doesn't make sense to move cache chiplets away from GDDR? In which case cache chiplets would be on top of the MCD. I suppose that would help with power/heat, since we can expect a large portion of MCD, taken up with GDDR PHYs, is not high in power density.

    In reality, we're all kinda playing chiplet bingo.

    Digging around in twitter threads, this comes up (the first 28 minutes):

    HIR Chiplet Workshop: Architectures and Business Aspects for Heterogeneous Integration | IEEETV

    But I don't think we can take much that's concrete from that.
     
    T2098, tsa1, Krteq and 1 other person like this.
  15. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    416
    Likes Received:
    379
    In a routing based NoC, they can be designed not to co-locate. The memory side cache and the coherence controller it tied to can live elsewhere, and misses can be fed back into the NoC to get routed to the right DRAM controller at the edge.

    Eypc Rome/Milan’s single socket NUMA configurability is an empirical evidence.

    In any case, I think cache-in-bridge-chiplets still have the highest score given its presence in patents. The reusing V-Cache die angle is not convincing because of the significance differences in design targets (Zen core clock vs fabric clock, banking, cache line size, etc).
     
    #1175 pTmdfx, Apr 22, 2022
    Last edited: Apr 22, 2022
    T2098 likes this.
  16. T2098

    Newcomer

    Joined:
    Jun 15, 2020
    Messages:
    55
    Likes Received:
    115
    I was envisioning them stacked upon the MCD for the exact same reasons; the MCD needs to be fairly large physically just for all the external I/O balls anyway, and is relatively low power density compared to the GCDs. Once they've done RDNA3 as a pipe-cleaner/proof of concept on the high-end, in my mind at least this opens them up to really leveraging some of their older nodes for the I/O.

    For an RDNA4 midrange product, I was thinking 12nm (or a very mature at that point Samsung 8nm or TSMC 7nm) MCD, exploiting the lower cost per transistor of the older nodes, and turning the larger die size into somewhat of a positive feature. All that silicon area gives you lots of room to tile up standardized L3 chiplets (or even dummy spacer silicon for product segmentation, not necessarily having to fill every L3 'pad') as desired. All comes down to how costly and well-yielding the advanced packaging processes are, I suppose.
     
  17. Granath

    Newcomer

    Joined:
    Jul 26, 2021
    Messages:
    80
    Likes Received:
    82
  18. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,426
    Likes Received:
    909
    6 months typically from tape out to release?
     
  19. fehu

    Veteran

    Joined:
    Nov 15, 2006
    Messages:
    2,067
    Likes Received:
    992
    Location:
    Somewhere over the ocean
    Does tape out include the full assembled chiplet or the single modules?
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Tape-out is purely for chips (chiplets).

    With the rumours suggesting that Navi 31 consists of three distinct chiplet designs, I would say that it doesn't tell us much. If two of the three chiplet designs are shared between Navi 31 and 32, we can only ask "why not do the smaller chiplet-based GPU first?"

    Code numbers were originally supposed to be about sequencing, i.e. that 31 is designed before 32, which is designed before 33. If 33 is taken to be the simplest iteration of RDNA 3 versus RDNA 2, then you could say it's reasonable to leave 33 tape-out until last, because it will proceed with the least risk.

    We've already seen that AMD is quite happy to wait months/years to release low-end designs, so we should expect 33 to be later. Why 32 is the latest might be a risk-reward trade-off based solely on Navi 31 progress. The competitive performance of Navi 21 may have affected the ordering, such that AMD brought Navi 31 much further forward, taking more risks, leaving a gap that 33 could fill.

    I don't believe AMD will deliver 31 and 32 "on time", for what it's worth. It's clear that 5800X 3D took much longer to get to market than V-cache should have taken (Zen 2 was built for V-cache), and Navi 31/32 are way more complex with tougher thermals, tougher packaging and tougher drivers than Navi 33.
     
    DavidGraham, fehu and xpea like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...