Next Generation Hardware Speculation with a Technical Spin [post E3 2019, pre GDC 2020] [XBSX, PS5]

Discussion in 'Console Technology' started by DavidGraham, Jun 9, 2019.

Tags:
Thread Status:
Not open for further replies.
  1. snc

    snc
    Newcomer

    Joined:
    Mar 6, 2013
    Messages:
    198
    Likes Received:
    97
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,349
    Likes Received:
    3,884
    Location:
    Well within 3d
    The err files generally mention errors for instructions the GPUs recognize, though not always. There's usually a more explicit mention of an opcode not being supported at all, but since this is barely a skeleton of an error file the error types all a generic string.
    The first set of errors wouldn't be an unrecognized opcode, but a conflict with a different ISA's version of a deep-learning instruction, which appears to be an opcode from GFX908 (Arcturus?).
    s_getreg is a standard instruction, so if there's an error condition perhaps related to the specific hardware register value being flagged it would probably get a different error string than the unhelpful one present.
    The image load instructions are also known opcodes, so if there is a problem it may have to do with the specific mode and resources flagged, which would be a different error than the instruction being missing.

    It's not definitive proof, but there seems to be a possibility that the bvh encoding is recognized in some way by that version of GFX10.
     
  3. Tkumpathenurpahl

    Tkumpathenurpahl Oil my grapes.
    Veteran Newcomer

    Joined:
    Apr 3, 2016
    Messages:
    1,446
    Likes Received:
    1,253
    True, although it's not that unthinkable for such specs to be accurate for a dev kit.

    A 14/15TF Vega would get kicked around by a 10TF Navi. Even the 80CU Arcturus aspect isn't that strange when you consider Navi uses dual CU's.

    Doubling of memory used to be quite common in dev kits, so 28GB could still be indicative of the 16-24GB of memory we're likely to see in the final console.

    The only thing about that dev kit "leak" that strikes me as particularly weird is the number of CPU cores. I don't know how likely more than 8 cores would be, and I don't recall dev kit CPU's differing from retail in anything but clockspeed.
     
    PSman1700 likes this.
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,349
    Likes Received:
    3,884
    Location:
    Well within 3d
    What we know now about the Arcturus name is that it belongs to a specific chip, and that software changes for drivers and LLVM indicate the chip is a derivative of the Vega line that is very focused on being an MI-100 deep-learning product--to the point that it lacks a 3D command processor. That seems like a warning flag.

    That context aside, the 1 GB L4 cache seems odd as well. First, what we know of Navi or caches that would be reasonable leaves questions about it. Even for a CPU hierarchy, L4 caches are rare, and AMD's GPUs stop at L2. The size doesn't seem to make sense for the bandwidth it gives. Bandwidth-wise it seems rather unimpressive for a GPU (Navi's L2 already delivers that much bandwidth or more.)



    As for the speculation about RDNA 2 versus RDNA 1, I think the process node is a potentially weaker indicator of architecture since AMD hasn't restricted families to one process node and semicustom allows for things to hop nodes if the customer pays. What AMD has for a roadmap for its consumer or professional products also doesn't bind its semicustom products.
    Normally, I'd tend to focus on what GPUs are launched nearest to the consoles, as the current gen had a representative in Bonaire half a year prior to launch.
    However, I'm not confident that it's required that a client GPU with a similar tech baseline to the consoles has to launch before them. It would probably be the case that whatever GPUs are in development and at similar stages with the consoles would be more representative, and that a client GPU could potentially wrap up its final stages faster than a whole console SOC and platform.
    However, random delays and AMD's product cycles might not line up like they did with Bonaire, and some things like the front end of Hawaii more closely resembling the PS4's show that things don't march in lockstep.
     
  5. Nisaaru

    Regular

    Joined:
    Jan 19, 2013
    Messages:
    989
    Likes Received:
    286
    The original MS Arcturus rumour was from late January and included infos which weren't widely themed so somebody who either was well informed and just had a pulse on AMD NDA docs/rumours created it or there is more behind it. Then the rumour was "dismissed" and people didn't really touch it anymore for months. The first time the Arcturus name popped up with no real infos was September 2018.

    But then Silenti posted this with a video link in May.

    https://forum.beyond3d.com/posts/2068150/

    which discussed AMD's Zen3 and other product plans and suddenly these infos like a configurable changed Hyperthreading scheduler and L4 IO cache patents pop up it felt like a weird coincidence. Sure, who knows what's real with these youtube HW rumour channels so it's possible the guy was influenced by this original rumour. But discussing this *4* months later with more details sounded unlikely to me. To me these looked more like 2 different information sources.

    What does RT really need? A large low latency cache to keep track of the BVH tree. So just from that perspective such a large EDRam Cache made sense to me at least from a design perspective. Crystalwell was from 2013 and used 128MB. Doing a 1GB version 7 years later should be possible and who knows if the size was just to profile real scenarios.
     
  6. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    9,915
    Likes Received:
    9,281
    Location:
    Self Imposed Work Exile: The North
    We know so far Nvidia will set aside 1GB of VRAM for it.
    The remaining is likely some form of ALU to perform the intersections. I guess Cache as well.
     
  7. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    943
    Likes Received:
    1,060
    Where do you get this number from?
    I would assume the BVH can have any size depending on scene, also larger than 1GB? Likely 1GB is some kind of average from current games?

    Beside BVH, it would be interesting to know if they use a stack per ray (likely) and where they store it.
    The only option i see is L1 (which they also use for LDS / texture cache), likely reserving a fixed, small size per ray and if this becomes too small the traversal algorithm needs some restart fallback to still give correct results.
    I guess this will be very similar for AMD, maybe the TMU patent did even mention LDS. (They also added some short stack traversal variants to Radeon Rays.)

    Also interesting: Does NV use binary tree or MBVH?
    AMDs TMU patent mentioned a branching factor of 4, so the tree would have much less levels than a binary one, and memory access can be made more coherent.
    (Still, i wonder why not 8, because that's a nice number to divide 3D space like an octree, while 4 would be nice for 2D like quadtree - must have technical reasons.)

    However, it's just details. I do not expect a big difference between NV / AMD or desktop / console from the programmers perspective.
    Only exception is AMDs option for programmable traversal to enable things like stochastic LOD, fallback to reflection probes if ray takes too long, things like that.
     
  8. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    9,915
    Likes Received:
    9,281
    Location:
    Self Imposed Work Exile: The North
    I forgot who did the testing; but someone kept checking VRAM values with RTX on and RTX off. And RTX on always set aside 1GB of VRAM.
    I don't have more information, perhaps one of the first DF tests on BFV, but I believe it may have been @Dictator who did this test. I don't know if this is for 1 title or all titles. It hasn't been brought up since the first reviews on RTX titles. Might be a good time to check again.
     
    pharma, JoeJ and London-boy like this.
  9. bbot

    Regular

    Joined:
    Apr 20, 2002
    Messages:
    750
    Likes Received:
    13
    Official confrmation of "future RDNA architectures" <grin>

     
    DavidGraham, chris1515 and Scott_Arm like this.
  10. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    21,443
    Likes Received:
    6,910
    Location:
    ಠ_ಠ
    I was increasingly concerned about DXR just being a low-level construct for nVidia GPUs.
     
    BRiT likes this.
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,349
    Likes Received:
    3,884
    Location:
    Well within 3d
    There was a recent slide from AMD that indicates no change in the SMT setup for Zen3, and no mention of an L4 cache.

    The on-die latencies for AMD's GPUs are very significant, on the order of ~100 cycles for an L1 hit and ~200 for a hit to the L2 for GCN. AMD has touted something like 10% improvement in some unspecified way for Navi.
    I think it would be difficult for an L4 in this instance to change the overall picture, since its latency would be additive to the L2 and presumably the L3 that would need to exist between the L2 ad L4.
     
    psolord, Pixel and TheAlSpark like this.
  12. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    42,985
    Likes Received:
    15,120
    Location:
    Under my bridge
    If the cache utilisation is dire in random ray tracing and you're constantly hitting main-memory, a fat cache would make sense. It depends on how well the existing caches can cope with the memory-access patterns of ray-tracing, and whether the tracing can be performed in a way that suits the caches.
     
    SpeedyGonzales and TheAlSpark like this.
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,349
    Likes Received:
    3,884
    Location:
    Well within 3d
    If we're focusing on the portion of ray tracing accelerated by BVH hardware, it's potentially better behaved if there's some level of coherence.
    This isn't addressing the BVH building process, although Nvidia seems comfortable with shader hardware performing a lot of the build work for the low-level structure.

    Traversal goes through a read-only structure (helps avoid r/w penalties DRAM hates) that is usually formatted to pack data into nodes that are cache-line friendly, and a traversal method that favors depth can provide opportunities for cache line reuse or some locality if there are concurrent ray traversals in a similar path nearby (in terms of location in the BVH and in time).
    A likely reason for the RT cores is that they seem to accelerate a version of a stack-based algorithm, which allows for structures that tend to be more compact in memory and the set of intermediates nodes that it might go back to can cache better.

    Whether this makes it good rather than just being better than some alternatives hasn't been clearly tested from the analyses I've seen.
    Early on, there were signs that RT was not bogging down in terms of bandwidth, but that there were signs of heavier synchronization or contention that may have been improved since then.
    Traversal would involve more round trips to memory as the hardware hops from node to node, though in that case an L4 cache outside of 3 or 4 other cache levels in a GPU seems unlikely to have much improvement. AMD's L2 is 50-60% of DRAM latency already, so two more layers of cache in a GPU hierarchy seem risky in terms of getting better latency.
    Even if it did, a cache of that size is a big investment for one feature, and keeping it in the hierarchy if RT is thrashing caches means all prior levels of cache are being thrashed before they get to an L4 that may be arguable in its benefits.
     
    MBTP, DavidGraham, pharma and 8 others like this.
  14. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,580
    Likes Received:
    980
    L1 and L2 cache latency can have an impact on ray tracing. The top level nodes of your global space decomposition structure lives in your caches, because they are hit all the time. The bottom levels almost always miss the caches and hit main memory (ray coherency not withstanding).

    On a CPU a first order approximation of cost is to consider traversing the cached top levels free, but if the latency is as high as detailed above, then that's certainly not the case for GCN.

    Cheers
     
  15. bbot

    Regular

    Joined:
    Apr 20, 2002
    Messages:
    750
    Likes Received:
    13
    If the Project Scarlett die is 350mm^2 on 7nm, someone gave an estimate of 40CUs. But what if you assume 7nm+, then what would the CU count be?
     
  16. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,648
    Likes Received:
    617
    Location:
    Somewhere over the ocean
    40CU+
     
    psolord, DSoup, Rootax and 7 others like this.
  17. Pinstripe

    Newcomer

    Joined:
    Feb 24, 2013
    Messages:
    109
    Likes Received:
    55
    Probably the same, because you need to account for more die area in RT-hardware, whether that's larger CU's or fixed function hardware.
     
  18. Proelite

    Veteran Regular Subscriber

    Joined:
    Jul 3, 2006
    Messages:
    1,417
    Likes Received:
    744
    Location:
    Redmond
    52CU (48 enabled) on 7nm, 64CU (56 enabled) on 7nm+.
     
  19. McHuj

    Veteran Regular Subscriber

    Joined:
    Jul 1, 2005
    Messages:
    1,551
    Likes Received:
    735
    Location:
    Texas
    I think this is absolutely correct. Going from RDNA to RNDA2 (Or 1.5), if anything the CU's will be increasing in size be it added functionality or even larger caches to increase bandwidth further especially for RT.

    36 to 40 active CU's is probably the right target.
     
  20. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    2,344
    Likes Received:
    725
    AMD already confirmed 7nm, this isn't the baseless section either :)
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...