Next Generation Hardware Speculation with a Technical Spin [post E3 2019, pre GDC 2020] [XBSX, PS5]

Discussion in 'Console Technology' started by DavidGraham, Jun 9, 2019.

Tags:
Thread Status:
Not open for further replies.
  1. MrFox

    MrFox Deludedly Fantastic
    Legend Veteran

    Joined:
    Jan 7, 2012
    Messages:
    6,488
    Likes Received:
    5,996
    I'm curious how VRS will impact unstable frame rates. If the screen is filled with fine foliage, there's no area of the screen to rate down, so is VRS doing nothing for some framings and do great for others?
     
    egoless likes this.
  2. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,576
    Likes Received:
    16,033
    Location:
    Under my bridge
    You could probably force it, compression style (maximum bitrate), to apply enough to get the performance gains needed for a framerate. You'll run a pass for low detail where VRS is a good fit, and if there's not enough low-detail area, change the threshold until there is. Thus getting the equivalent of macroblocking with shader detail. Maybe a big explosion blurs out the detail a bit, and on dense foliage, the foliage detail is just reduced, then increased on the same assets when there are less of them. You could also just up VRS in the periphery, keeping everything centre-screen sharp and reducing detail towards the edges, foveated-rendering style.
     
    turkey likes this.
  3. Proelite

    Veteran Regular Subscriber

    Joined:
    Jul 3, 2006
    Messages:
    1,490
    Likes Received:
    877
    Location:
    Redmond
  4. MrFox

    MrFox Deludedly Fantastic
    Legend Veteran

    Joined:
    Jan 7, 2012
    Messages:
    6,488
    Likes Received:
    5,996
    Okay, so we might get some really good dynamic VRS replacing the coarse method of dynamic resolution to get stable frame rate.
     
    Scott_Arm, pharma and milk like this.
  5. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,576
    Likes Received:
    16,033
    Location:
    Under my bridge
    Yep. Geometry should hopefully remain 'native' rendering resolution, with VRS adjusting the details thereof.
     
  6. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,007
    Likes Received:
    1,157
    :O
    Super hot stuff!
    Give it to me, now!!!
    <3 <3 <3
     
  7. X-AleX

    Newcomer

    Joined:
    May 20, 2005
    Messages:
    75
    Likes Received:
    14
  8. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,365
    Likes Received:
    14,121
    Location:
    The North
    Summary? I didn’t get anything from the abstract
     
  9. Globalisateur

    Globalisateur Globby
    Veteran Regular Subscriber

    Joined:
    Nov 6, 2013
    Messages:
    3,795
    Likes Received:
    2,669
    Location:
    France
    Or a hybrid solution mixing both methods.
     
  10. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    22,057
    Likes Received:
    8,265
    Location:
    ಠ_ಠ
    Compute tunnelling?
     
  11. anexanhume

    Veteran Regular

    Joined:
    Dec 5, 2011
    Messages:
    2,074
    Likes Received:
    1,528
    Could also go the foveated approach and do the most LOD in center of screen near cursor. Perhaps rated by amount of motion in a particular frame. The latter would be similar to AMD’s “Radeon Boost” feature: https://www.pcgamesn.com/amd/radeon-boost-performance-benchmarks
     
  12. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,007
    Likes Received:
    1,157
    umm... after trying to decypher some sense out of it, i want to forward the question to the experts here.

    In the initial motivation they talk about small workloads, where task allocation and sheduling costs cause inefficiency. That's a main problem for me, so i got excited.

    Now i think it describes maybe a low latency way of CPU communicating with GPU, maybe bypassing something like a clumsy API and its command lists.
    But i'm usure if the purpose here is to feed the large GPU with compacted workloads from smaller tasks,
    or if this GPU coprocessor is just here to process unique tasks that are parallel friendly, but to small in number to make sense for a wide GPU.

    I'm left in total confusion.
    What i want is basically like NVs task shaders, but dispatching compute with variable workgroup width, and keeping some data flow on chip if possible.
    Seems this is something different, but still sounds interesting...
     
  13. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    17,426
    Likes Received:
    7,137
    Interesting, there's quite a variety of things in there. Various VRS patents, a game streaming related patent (latency), RT related patents, graphics development related patents, a GPU related patent.

    Interesting stuff, but I'll leave it for someone more technical to try to determine what might or might not be applicable to consoles.

    Regards,
    SB
     
  14. anexanhume

    Veteran Regular

    Joined:
    Dec 5, 2011
    Messages:
    2,074
    Likes Received:
    1,528
    I would say that the patents co-authored with Mark S. Grossman are a safe bet as intended for consoles. He's the Xbox chief GPU architect.
     
    BRiT and Silent_Buddha like this.
  15. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,365
    Likes Received:
    14,121
    Location:
    The North
    It Just Works™ ?
     
  16. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    22,057
    Likes Received:
    8,265
    Location:
    ಠ_ಠ
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,517
    Likes Received:
    4,572
    Location:
    Well within 3d
    It's been some time since then, so I've probably forgotten many things, but which benchmarks or metrics had a 5x lead? There were some specific use cases like double-precision that I can remember, although that would understandably be of little concern outside of compute like HPC--where AMD's lack of a software foundation negated even leads like that.


    This came up in the pre-E3 thread.
    https://forum.beyond3d.com/posts/2067755/

    I speculated on a few elements of the patent here:
    https://forum.beyond3d.com/posts/2069676/

    One embodiment is a CU with most of the SIMD resources stripped from the diagram, and other elements like the LDS and export bus removed.
    From GCN, it's a loss of 3/4 of the SIMD schedulers, while from Navi it's a loss of 1/2. SIMD-width isn't touched on much, although one passage discusses a 32-thread scenario.

    Beyond these changes, the CU is physically organized differently, and its workload is handled differently.
    The SIMD is in one embodiment arranged like a dual-issue unit, and there is a tiered register file with a larger 1-read and 1-write file and a smaller multi-ported fast register file. There is a register-access unit that can be used to load different rows from each register bank, and a crossbar that can rearrange values from the register file or the outputs of the VALUs. Possibly, the loss of the LDS may not have removed the hardware involved in the handling of more arbitrary access of the banked structure, and it was repurposed and expanded upon for this. Use cases like efficient matrix transpose operations were noted as a use case for these two rather significant additions to the access hardware.

    The workload handling is also notably changed. The scalar path is heavily leveraged to run a persistent thread, which unlike current kernels is expected to run continuously between uses. The persistent kernel monitors a message queue for commands, which it then matches in a lookup table with whatever sequence of instructions need to be run for a task.
    The standard path on a current GPU would involve command packets going to a command processor, which then hands off to the dispatch pipeline, which then needs to arbitrate for resources on a CU, which then needs to be initialized, and then the kernel can start. Completion and a return signal is handled indirectly, partly involving the export/message path and possibly a message/interrupt engine. Subsequent kernels or system requests would need to go through this process each time.

    The new path has at least an initial startup, but only for the persistent thread. Once it is running, messages written to its queue can skip past all the hand-offs and into the initial instructions of the task. Its generation of messages might also be more direct than the current way CUs communicate to the rest of the system.
    This overall kernel has full access to all VGPRs, so it's at least partly in charge of keeping the individual task contexts separate and needs to handle more of the startup and cleanup that might be handled automatically in current hardware. There's some concurrency between tasks, but from the looks of things it's not going to have as many tasks as a full SIMD would. The scalar path may also see more cycles taken up by the persistent kernel rather than direct computation.

    There was one possible area of overlap with GFX10 when the patent mentioned shared VGPRs, but this was before the announcements of sub-wave execution, which has a different sort of shared VGPR.
    Other than a brief mention of its possibly being narrower than GCN, it's substantially different from GCN and RDNA.

    Use cases include packet processing, image recognition, cryptography, and audio. These are cited as workloads that are more latency-sensitive and whose compute and kernel state doesn't change that much.
    Sony engineering has commented in the past that audio processing for the PS4 GPU was very limited due to its long latency, and AMD has developed multiple projects for handling this better--be it TrueAudio, high-priority queues, TrueAudio next, and the priority tunneling for Navi. This method might be more responsive.
    Perhaps something like cryptography might make sense for the new consoles with their much faster storage subsystems, which I would presume to be compressed and encrypted to a significant degree. Not sure GPU hardware would beat dedicated silicon for just that one task.

    Other elements, like image recognition and packet processing might come up in specific client use cases, but I would wonder if this could be useful in HPC as well.
    The fast transpose capability is something that might benefit one idea put forward for ray-tracing on an AMD-like GPU architecture (can pack/unpack ray contexts to better work around divergence), although in this instance it would be less integrated than AMD's TMU-based ray-tracing or even Nvidia's RT cores, since this new kind of CU would be much more separate and it may lack portions of standard capability. It's not clear whether such a CU or its task programs would be exposed the same way, as there are various points where an API or microcode could be used rather than direct coding.
     
  18. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,007
    Likes Received:
    1,157
    Thanks for clearing up the patent, i knew you would :) Your insights are quite priceless!

    The benchmark is my own work on realtime GI. The workloads are breadth first traversals of BVH / point hierarchy, raytracing for visibility, building acceleration structures. But it's not comparable to classic raytracing. Complexity is much higher and random access is mostly avoided. The general structure of programs is load from memory, heavy processing using LDS, write to memory. Rarely i access memory during the processing phase, and there is a lot of integer math, scan algorithms, also a lot of bit packing to reduce LDS. Occupancy is good, overall 70-80%. It's compute only - no rasterization or texture sampling.

    The large AMD lead was constant over many years and APIs (OpenGL, OpenCL 1.2, finally Vulkan) The factor 5 i remember from the latest test in Vulkan comparing GTX670 vs. 7950, two years ago.
    Many years ago i bought a 280x to see how 'crappy' AMD performs with my stuff, and i could not believe it destroyed Kepler Titan by a factor of two out of the box.
    At this time i also switched from OpenGL to OpenCL, which helped a lot with NV performance but only a little with AMD. I concluded neither AMDs hardware nor their drivers are 'crappy' :)
    Adding this to the disappointment of GTX670 not being faster than GTX480, i missed the following NV generations. Also i rewrote my algorithm which i did on CPU.
    Years later, after porting results back to GPU (using OpenCL and Vulkan) i saw after the heavy changes the performance difference was the same. Rarely a shader (i have 30-40) shows an exception.
    I also compared newer hardware: FuryX vs. GTX1070. And thankfully it showed NV did well. Both cards have the same performance per TF, just AMD offers more TF per dollar. So until i get my hands on Turing and RDNA i don't know how things have changed further.

    Recently i learned Kepler has no atomics to LDS, and emulates with main memory. That's certainly a factor but it can't be that large - i always tried things like comparing scan algorithm vs. atomic max and picking the faster per GPU model.
    So it remains a mystery why Kepler is so bad.
    If you have an idea let me know, but it's too late - seems 670 has died recently :/

    One interesting thing is AMD benefits much more from optimization, and i tried really hard here because GI is quite heavy.
    Also NV seems much more forgiving to random access, and maybe i'm an exception here, comparing to other compute benchmark workloads.
     
    Rootax likes this.
  19. anexanhume

    Veteran Regular

    Joined:
    Dec 5, 2011
    Messages:
    2,074
    Likes Received:
    1,528
    AMD has been looking at this a while. This is a paper from 2014 in which they propose modifying the ALUs for only a 4-8% area increase. The propose 4 traversal units per CU. This is a 1-for-1 match to the number of TMUs, which is exactly where the hardware resides in more recent AMD patents on ray tracing.

    https://pdfs.semanticscholar.org/26ef/909381d93060f626231fe7560a5636a947cd.pdf

    This is proposed changes to Hawaii (R-290X). With Navi's enhanced caches, I would think it's already more suitable to the modifications described.

    Here's their latest patent:

    http://www.freepatentsonline.com/20190197761.pdf

    [​IMG] https://external-preview.redd.it/W7...bp&s=f68ea3f27e0a6299c7abb86f92e7b3ad38d9c9d4
     
    #2419 anexanhume, Jan 2, 2020
    Last edited: Jan 2, 2020
    BRiT, ToTTenTranz, JoeJ and 1 other person like this.
  20. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,007
    Likes Received:
    1,157
    Reminds me about this paper, which also took this architecture as example: https://pdfs.semanticscholar.org/26ef/909381d93060f626231fe7560a5636a947cd.pdf
    Don't know if the work is related directly to AMD.

    Yeah, they have tons of RT experience, and more research / software experience than AMD in general.
    This led me to the initial assumption RTX must be very advanced, including reordering to improve both ray and shading coherence. And AMD could never catch up.
    But without all this advanced stuff all that remains is simple tree traversal and triangle intersection, which is all NVs RT cores are doing, and there is no other form of hardware acceleration. As far as we know.
    And this means AMD can catch up easily very likely. It also means RT does not waste too much chip area just for that, so makes more sense to begin with anyways.

    Still, we don't know what MS / Sony wants their RT solution to look like. Especially the latter could aim for flexibility because abstractions was not their goal so far. And from MS there are already public proposed DXR extensions requiring more flexible hardware.
    I expect the difference being too large for a fair comparison with first gen RTX - flexibility will have a cost even if it's not used for something like a raw MRays/sec test.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...