Next Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Discussion in 'Console Technology' started by Proelite, Mar 16, 2020.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,363
    Likes Received:
    3,944
    Location:
    Well within 3d
    It may not be specific to the PS4, but some time ago there was discussion about getting improved performance for GPU particles by sizing the tiles to match the footprint of the ROP caches. The general workflow assumes ROP caches are continuously servicing misses to memory, but staying within their caches in workloads that permit it leverages their broader internal data paths while significantly reducing their DRAM bandwidth consumption.
    Double ROPs in that subset of the work would be able to scale performance without needing as much memory bandwidth.

    AMD's had an IOMMU of some form going back at least as far as Trinity. There's an IOMMU in the current consoles, and Kaveri fell just short of a full HSA device. HUMA was the marketing point that PS4 fans latched onto, for example.
    It's been present for years in standard hardware, so the next gen should be expected to continue to have it.
     
    Pete, tinokun, PSman1700 and 2 others like this.
  2. BRiT

    BRiT Verified (╯°□°)╯
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    15,573
    Likes Received:
    14,165
    Location:
    Cleveland
    It does, and that's why a 125x improvement is available as a baseline on all NextGen consoles.
     
    PSman1700 and AzBat like this.
  3. ultragpu

    Legend Veteran

    Joined:
    Apr 21, 2004
    Messages:
    6,219
    Likes Received:
    2,263
    Location:
    Australia
    So back to where we were before, a faster SSD obviously streams things faster thus affording more detail or assets on screen than one that's twice as slow, that's obviously assuming on top of using all the available RAM of course. Do you not agree that's PS5's power advantage?
     
  4. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,477
    Likes Received:
    10,164
    Location:
    The North
    hmm...
    I would probably disagree with that. Things like fillrate and tessellation are all things that would improve with clockrate. This much I agree with. But GPU problems are massive and have designed meant to attack it in parallel. I have found that blocks (cores) do tons of work faster than threads within cores. So you can have multiple threads in a thread block, or you can run more blocks with less threads. And the number of blocks you got running is going to annihilate threads in terms of performance by magnitude order. At least this is what CUDA has shown me for simpler matters. When you need threads to work together, then well you need to issue more threads per block. As blocks don't really share resources with each other. So there are pros/cons to having more blocks or less blocks.
    (I'm not an AMD guy, sorry, the whole 64 threads per wave as a unit of work is lost on me at the moment, I guess that means for good saturation you need all 64 threads filled or something.)

    *edit I recognize that things are changed with RDNA. It might be best to have a technical discussion here on how best to optimize performance on that architecture.

    This is just the way GPUs are designed, to be embarrassingly parallel. I'm not a 3D engine person, so I don't know for sure. But I suspect that if you've got loads and loads of work over less cores, the only way you can saturate the core is to stuff in more threads. The issue with threads comes down to overhead penality of a thread switch. So maybe instead you just do it proper, you issue the correct number of threads per threadblock. And you'll normally have more threadblocks than available physical cores to work on a thread block. So you've got this massive 4K@60 render going, you have 36 CUs, you're going to to have less blocks running in parallel. Textures are broken up in to little 4x4 blocks of information compressed specifically for random access exactly designed for this purpose of having individual cores process 4x4, 8x8 blocks of pixels.

    So my answer is, it's certainly NOT clear that a PS5 can beat the same GPU with more cores operating at a slower frequency. As we move further towards compute shaders the requirements for ROPs will continue to drop. The nice thing with compute is that it's needs around bandwidth don't need to be excessively high. It can do all sorts of operations without needing a high amount of bandwidth.
     
    #2284 iroboto, May 5, 2020
    Last edited: May 5, 2020
    Pete, Michellstar, BRiT and 1 other person like this.
  5. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,477
    Likes Received:
    10,164
    Location:
    The North
    It's not so simple unfortunately.
    Pulling textures is just _1_ aspect of what's needed to be done. There are a great deal of a number of render targets that will be written to draw a frame.

    If you want nice 4K HDR textures, you're looking at something like BC6 where you are using about 64bits per pixel. Yea that's being sent compressed. But you've got to generate mips, you gotta sample, you have ansio for oblique, all these things require bandwidth within memory. Just because you removed the bottleneck on I/O doesn't mean you can freely dump 16K textures into your game. They need to be processed before being sent out. You still have memory limitations, you have bandwidth limitations, and depending on what you're trying to accomplish you may have ALU limitations. All sorts of compression techniques come with their own draw backs and advantages.

    The main thing to note is that, while it's great that I/O is improved. As someone has posted earlier, the speed of bandwidth is falling dramatically behind the speed at which compute is increasing. I/O isn't bandwidth though, you're still going to hit a bandwidth wall.
     
    #2285 iroboto, May 5, 2020
    Last edited: May 5, 2020
    Pete, dobwal, BRiT and 1 other person like this.
  6. ultragpu

    Legend Veteran

    Joined:
    Apr 21, 2004
    Messages:
    6,219
    Likes Received:
    2,263
    Location:
    Australia
    Well let's hope Cerny has sorted out the bandwidth factor when he put this SSD in the console, we'll find out in due time whether PS5 is bandwidth limited or the SSD is over designed.
     
  7. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    2,520
    Likes Received:
    782
    Higher settings, more stable fps etc. Resolution with todays reconstruction tech is less of a focus. They can target higher res and higher settings etc, it’s only a more powerfull gpu with more advanced features like vrs etc, but also faster cpu and higher BW (which can bottleneck a whole system). And no boost or variable clocks/perf either.
    In general cross platform games will be best served on the xsx, and xsx exlusives that leverage to the 12+ TF of gpu power will do things perhaps not possible on 9/10tf hardware. Its 2070 vs 2080Ti afterall.
     
    milk and AzBat like this.
  8. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,477
    Likes Received:
    10,164
    Location:
    The North
    PS5 still going to walk away with faster loading times even if the faster speed doesn't translate to in game differences. There is also ALU requirements right?

    You can't stack a 40GB/s SSD onto a XBO and expect it to process 4K textures and higher settings just because it's I/O is solved.
     
    #2288 iroboto, May 5, 2020
    Last edited: May 5, 2020
    BRiT and PSman1700 like this.
  9. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,729
    Likes Received:
    713
    Location:
    Somewhere over the ocean
    It's like a single man stop meditating, slowly stand up, and calmly says: "I know how to PS5"
     
    Pete likes this.
  10. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    2,096
    Likes Received:
    957
    Location:
    Earth
    Do we know anything other than clockspeed and flops of next gen console gpus? For example could it be both consoles have same amount of ROPs in which case there could be use cases where ps5 has advantage. i.e. the cerny's narrow and higher clockspeed could be a real thing on some limited cases?
     
  11. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,477
    Likes Received:
    10,164
    Location:
    The North
    any situation where PS5 is not bandwidth bound, then yea it's going to have an advantage while using ROPs. I think tessellation is dead as a feature, people are likely to use mesh shaders.
    Smaller workloads etc.
     
  12. Jay

    Jay
    Veteran Regular

    Joined:
    Aug 3, 2013
    Messages:
    2,615
    Likes Received:
    1,683
    The funny thing is no one has said there aren't cases where clockspeed wouldn't have an advantage. Just not to the extent that some believe. Until we see how RT etc impacts both designs.

    What if MS said the following:
    We went narrow and fast as we knew how fast the front end needs to be and to have better occupancy including changes made to facilitate exexute indirect.
    So their basing it on 12TF 52CU 1.8Ghz, compared to 12TF 64CU.
    So that would make xsx fast and narrow. It's relative.

    It just depends whos talking, from what persepctive, etc. What Cerny said was true for the PS5. Doesn't mean he wouldn't have elected for 12TF if their design was different from the start.
    Cerny was talking about 10TF and best way they felt to get to it. Not that it's necessarily better in any way than 12TF.
     
    KirkSi, Michellstar, DSoup and 11 others like this.
  13. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,035
    Likes Received:
    5,576
    Both consoles most probably do have the same amount of ROPs (Navi 10 with 36/40 CUs has 64 ROPs, Microsoft has traditionally been conservative on ROP count), meaning the PS5 will have >22% higher fillrate.
    Whether or not the PS5 can take advantage of its fillrate throughput considering its bandwidth (considering it supposedly has to share its 448GB/s with a large CPU, whereas Navi 10 doesn't) is a different story.
     
    manux and iroboto like this.
  14. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,477
    Likes Received:
    10,164
    Location:
    The North
    When Cerny said that it was very challenging to have high occupancy, he may have been referring to GCN. As it's very difficult to keep it saturated just given the nature of how it does work.
    But AMD specifically went to address this with RDNA.
    The whole introduction of the white paper talks about improving the efficiency to send more work out to more CUs.

    ***
    The new RDNA architecture is optimized for efficiency and programmability while offering backwards compatibility with the GCN architecture. It still uses the same seven basic instruction types: scalar compute, scalar memory, vector compute, vector memory, branches, export, and messages. However, the new architecture fundamentally reorganizes the data flow within the processor, boosting performance and improving efficiency.

    In all AMD graphics architectures, a kernel is a single stream of instructions that operate on a large number of data parallel work-items. The work-items are organized into architecturally visible work-groups that can communicate through an explicit local data share (LDS). The shader compiler further subdivides work-groups into microarchitectural wavefronts that are scheduled and executed in parallel on a given hardware implementation. For the GCN architecture, the shader compiler creates wavefronts that contain 64 work-items.

    When every work-item in a wavefront is executing the same instruction, this organization is highly efficient. Each GCN compute unit (CU) includes four SIMD units that consist of 16 ALUs; each SIMD executes a full wavefront instruction over four clock cycles. The main challenge then becomes maintaining enough active wavefronts to saturate the four SIMD units in a CU. The RDNA architecture is natively designed for a new narrower wavefront with 32 work-items, intuitively called wave32, that is optimized for efficient compute. Wave32 offers several critical advantages for compute and complements the existing graphics-focused wave64 mode.

    One of the defining features of modern compute workloads is complex control flow: loops, function calls, and other branches are essential for more sophisticated algorithms. However, when a branch forces portions of a wavefront to diverge and execute different instructions, the overall efficiency suffers since each instruction will execute a partial wavefront and disable the other portions.

    The new narrower wave32 mode improves efficiency for more complex compute workloads by reducing the cost of control flow and divergence. Second, a narrower wavefront completes faster and uses fewer resources for accessing data. Each wavefront requires control logic, registers, and cache while active. As one example, the new wave32 mode uses half the number of registers. Since the wavefront will complete quicker, the registers free up faster, enabling more active wavefronts. Ultimately, wave32 enables delivering throughput and hiding latency much more efficient. Third, splitting a workload into smaller wave32 dataflows increases the total number of wavefronts. This subdivision of work items boosts parallelism and allows the GPU to use more cores to execute a given workload, improving both performance and efficiency.
     
    Pete, DSoup, PSman1700 and 6 others like this.
  15. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,035
    Likes Received:
    5,576

    ALU occupancy is reportedly low on GCN in games (I read somewhere figures that shocked me, typically 40 to 60%? which is why Vega 56 @ Vega 64 are indistinguishable at ISO clocks), but do we have numbers on RDNA occupancy?
    The Pascal GTX 1070 vs. Maxwell Titan X comparison tells us that narrower and faster clocked gets somewhat better results on a similar architecture, despite the large bandwidth advantage of the latter (256GB/s vs. 336GB/s).
    Cerny is right even regarding available bandwidth, though I doubt the narrower + higher clocked will fully make up for the 18% difference in compute and 25% in memory bandwidth.




    BTW, does the Series X offer a low-level API or does everything still need to go through a virtual machine?
     
  16. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,477
    Likes Received:
    10,164
    Location:
    The North
    I can only suspect that occupancy is better. Just looking at some examples of how shader code is run vs GCN. But i think it will be some time for that information to release, perhaps sometime next year once RDNA goes mainstream with the release of the consoles we will see something at GDC I suspect.

    Series X will still use a DX12 console based variant to offer slightly more access and features specifc to the console hardware, but everything is still run through virtual machine.
     
    PSman1700 likes this.
  17. Inuhanyou

    Veteran Regular

    Joined:
    Dec 23, 2012
    Messages:
    1,101
    Likes Received:
    272
    Location:
    New Jersey, USA
    So i just heard about the tempest engine being considered "hardware accelerated" as a sound system...which may help to free up CPU cycles for other tasks which is more important.

    Any technical minded folks in here wanna chime in for us dumb dumbs? Benefits, or weaknesses to this? I heard that it can sap bandwidth pretty badly?
     
  18. Jay

    Jay
    Veteran Regular

    Joined:
    Aug 3, 2013
    Messages:
    2,615
    Likes Received:
    1,683
    Believe on XO DX12 was low level and DX11.x was considered high level. Also think they pushed that idea on PC.

    Wonder if that's still the case on xsx, or if their now dropping DX11 totally now. Don't believe they've added RT support into DX11 on PC.
     
  19. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    21,578
    Likes Received:
    7,130
    Location:
    ಠ_ಠ
    Closest comparison I can think of is maybe putting an RX 590 (36CU Polaris) vs RX 5700 (36CU Navi), although the latter has double the ROPs, and the clocks would have to be normalized for throughput (core and memory).

    Would have to have really in depth analyses for something more specific though...
     
  20. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,477
    Likes Received:
    10,164
    Location:
    The North
    Technical thread on that is here:

    https://forum.beyond3d.com/threads/...ustics-windows-sonic-dolby-atmos-dts-x.61680/
     
    Pete, tinokun, PSman1700 and 2 others like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...