Next Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Discussion in 'Console Technology' started by Proelite, Mar 16, 2020.

  1. Dictator

    Newcomer

    Joined:
    Feb 11, 2011
    Messages:
    247
    Likes Received:
    939
    100% agree with this - I think decals, layers, and tiling detail textures plus shader diversity work (cloth shader next to an illum shader next to a skin shader next to a hair shader next to a fibre glass shader) is going to give that minute detail and differentiation - and not huge per object textures!
     
    BRiT likes this.
  2. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,287
    Likes Received:
    3,546
    I think by now we know this kind of memory arrangement has it's major drawbacks, that simply outweigh the drawbacks of a split memory pool. You pay for the flexibility in the shared memory pool by significantly increasing the bandwidth, otherwise, CPU/GPU contention will wipe clean any advantages you have in the shared pool arrangement.

    By the same logic, the PS5 also introduced CPU/GPU clocks contention, just looking at this superficially alone, you know this kind of system will never be optimal.
     
    VitaminB6 and PSman1700 like this.
  3. Betanumerical

    Veteran

    Joined:
    Aug 20, 2007
    Messages:
    1,692
    Likes Received:
    168
    Location:
    In the land of the drop bears
    Do we know if this is how it is actually laid out, or is it just setup as a slow and fast memory space with striping across different numbers of memory modules?. I don't think I saw anything that indicated that there was bandwidth or channels dedicated to the GPU and CPU in the XSX.

    Additionally wouldn't any memory thats accessed during this 'contented' period with only 4 chips only be accessible at 224GB/s?.

    Im not entirely sure what is being suggested here is how things work in the real world. To me it seems like the simpler solution of just giving exclusive access of the bus to a single 'unit' on the APU for the period of the memory transaction would be how it would work in practise.
     
  4. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,577
    Likes Received:
    16,028
    Location:
    Under my bridge
    It's not about flexibility, but capacity vs BW. If you have 8 GBs VRAM and 16 GBs RAM, your drawing either limits itself to 8 GBs VRAM to use its full BW with only 8GBs assets (less framebuffers), or stores assets in RAM and can use more than 8 GBs assets but has to copy them across the slow bus into VRAM to draw. If you have 16 GBs unified RAM, you have all 16 GBs available for assets but impact maximum BW.

    More assets, or faster reads and writes? Pick your poison - you can't have both. The PC's choice is nothing to do with it being the better compromise but the necessary design to support the open ended architecture, and it makes up for the crap RAM bandwidth by using great gobs of expensive VRAM as a massive cache to that redundantly duplicated data sitting in both pools.
     
    #1624 Shifty Geezer, Apr 1, 2020
    Last edited: Apr 1, 2020
  5. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,577
    Likes Received:
    16,028
    Location:
    Under my bridge
    But that doesn't explain (to me anyway!) why the BW drop was far higher than the CPU was using, and why that can't be fixed with a better memory controller. I would have expected (as did everyone else, because the BW drop came as a surprise) that while the CPU was accessing the RAM, the GPU had to wait, but it'd be 1:1 CPU usage to BW impact. What we saw on Liverpool was the RAM losing efficiency somehow, as if there was a switching penalty. I would hope AMD can fix that issue and have a near 1:1 impact on their console UMAs, so 1 ms of full RAM access for the CPU means only 1 ms less available for the GPU and the remaining frame time accessible at full rate.
     
  6. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,828
    Likes Received:
    10,867
    Location:
    The North
    Yea some parts weren’t clear. Well, @3dilettante explanation actually helps provide a lot of context of what might be happening with respect to optimizing memory layouts for either GPU for maximum bandwidth vs CPU for less paging. The more the CPU has to jump around the longer it takes; perhaps eating into bandwidth further.

    it may be different this time around 7 years later.
     
  7. psorcerer

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    732
    Likes Received:
    134
    I'm not sure we need to explain it in a 100th time. It was perfectly fine the first 10 times, but now it's getting ridiculous.
    The had problems maintaining 2+3 for a maximum possible load. Because it's a something that nobody does in hardware world.
    XBSX claims that they do it, but that claim should be scrutinized, because that claim is unrealistic, not Sony's one.
    Sony's claim is perfectly fine: when power hungry operations are used too much the GPU or CPU underclocks. It was a case for every GPU and CPU till now.
    The novel thing is that they measure the load/power by profiling the instructions in real-time.
    And yes any CPU vendor that states any frequency is doing it on an "average load" basis, and not on the "100% max load ever possible", which is usually a much much lower freq (like the 1.6GHz example for Intel on a 3.5GHz CPU).

    That's online lighting calcualtions. Using textures is a way of trading memory<->ALU. In the end if your pipeline is some sort of the "full" lighting solution RT/photon mapping/radiosity/etc. you don't need textures at all.
    You have your materials and all their properties are calculated in realtime.
    Unfortunately current state of ALU power prohibits such things from running real-time (even Minecraft-like graphics uses textures with RT).
    Therefore pre-baking things is still the most viable solution. Essentially you use ALU for the most dynamic stuff in your frame or the stuff that's most visible as "dynamic" in frame. And statically bake all other things.
    On the other hand there are a lot of other dynamic things that can be prebaked for a great effect.
    For example the weather. It changes pretty slowly (like over ~1000 frames), so you can have essentially a full blown weather transition with SSD, where each weather change brings a whole new set of an environment animations/details/decals/textures for example.
    You can have a permanent growth/damage/wear in a lot of places, which was a trouble to load in time with HDD.
    Etc.

    Not to mention that these copies are heavily CPU involved, because device buffer format is not the same as the host buffer format.
    Which is another thing that is a consequence of an open-ended architecture.
     
    megre likes this.
  8. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,828
    Likes Received:
    10,867
    Location:
    The North
    I no longer think it’s obvious how MS will address this after getting better insight. Yes it would be ideal to pick up the remaining amount instead of casting it aside.
     
  9. Betanumerical

    Veteran

    Joined:
    Aug 20, 2007
    Messages:
    1,692
    Likes Received:
    168
    Location:
    In the land of the drop bears
    I agree it would be ideal to pick up the remaining. But if that was the case wouldn't it infer the existence of three different memory spaces and not two?. Those three being.

    1 - Fast space (10/10 memory modules)
    2 - Slower space (6/10 memory modules)
    3 - Parallel access space (4/10 memory modules)

    You'd have to specifically allocate for all three spaces.
     
  10. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,828
    Likes Received:
    10,867
    Location:
    The North
    i still think it makes sense on console to do it this way. It just saves costs and you are right there is bandwidth loss during contention but when there isn’t the gains are there. And when it is present you lose bandwidth but not so much that it’s choking the system.

    seems a fair trade off. It’s lower but not to low, and it can reach some good highs.
     
    VitaminB6 likes this.
  11. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,828
    Likes Received:
    10,867
    Location:
    The North
    I don't know how many controllers there are, so I guess I'm not sure what is possible. I have pretty good confidence that MS knows the exact performance of this machine before they burnt it using live code. They should know as they did with X1X. If this is really causing an issue, we would know by now. But more importantly if it was really causing an issue and developers didn't like it, the option to convert the 4 single 1GB chips to 2GB is still an option. But I don't see this happening.

    MS has been rolling forward with their plans and hitting a cadence here. Observing their entire strategy shows me that everything is going with respect to their expectations and plans. A concession here would be something they missed and it would be startling that after years of 'customization' that they suddenly discovered the challenges of asymmetrical memory capacity right before launch. It's almost like MS spent 3 years doing nothing since they built most of their features and resolved most of their issues during the X1X generation.
     
    blakjedi, PSman1700 and BRiT like this.
  12. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,134
    Likes Received:
    3,030
    Location:
    Finland
    There's either 5 controllers with 4x16bit channels each, 10 32-bit controllers with 2x16-bit channels each or 20 16-bit controllers. Each GDDR6-chip has two independent 16-bit channels, which the memory controllers need to adhere to.
     
    blakjedi, Shoujoboy, function and 2 others like this.
  13. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,837
    Likes Received:
    1,155
    Location:
    Guess...
    The original xbox was pretty similar.

    RSX was quite comparable to modern high end PC GPUs of the time being largely equivalent to a Geforce 7800 GTX 256 - a more or less top end GPU of it's time. I'd say it was as at least as close to top end PC GPUs than what the PS4 or PS5 are launching with. It only seemed anemic at the time because it was preceded by Xenos which was well ahead of it's time, and came just before the much more powerful 8800GTX arrived in the PC market.
     
  14. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,134
    Likes Received:
    3,030
    Location:
    Finland
    While it was equivalent to 7800 GTX in many ways, it had only half the ROPs and half the memory bandwidth (not counting XDR here since RSX had to access it via CPU)
     
    PSman1700 likes this.
  15. chris1515

    Veteran Regular

    Joined:
    Jul 24, 2005
    Messages:
    4,786
    Likes Received:
    3,744
    Location:
    Barcelona Spain
    AMD has a patent for improve and mitigated this problem. They simply prioritize CPU memory call because it is more sensible to latency than GPU. I don't remember where to find the patent but someone find it on era @anexanhume maybe?
     
  16. anexanhume

    Veteran Regular

    Joined:
    Dec 5, 2011
    Messages:
    1,938
    Likes Received:
    1,276
    Yes, there was a patent. It’s going to take a while for me to find it.

    edit, there was this, but I think there’s another that more directly addresses memory use in a HSA system.
    http://www.freepatentsonline.com/y2019/0122417.html
     
    #1636 anexanhume, Apr 1, 2020
    Last edited: Apr 1, 2020
    BRiT likes this.
  17. function

    function None functional
    Legend Veteran

    Joined:
    Mar 27, 2003
    Messages:
    5,345
    Likes Received:
    2,813
    Location:
    Wrong thread
    Maybe ... if you're performing (for example) a single GPU access with data spread across two channels (say a 128-bit vector), and the CPU jumps to the front of the queue with a pesky 32-bit read, and the memory controller can't easily schedule something to do on the other channel with no notice ... the total lost bandwidth to the GPU would be greater than the actual data transferred to the CPU.

    If the GPU likes wide access and the CPU is often doing narrow, CPU access being prioritised could really multiply the losses to the GPU.

    Access patterns and priorities and all that, innit.
     
    BRiT likes this.
  18. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,837
    Likes Received:
    1,155
    Location:
    Guess...
    Very true. Although it still likely fared at least as favorably towards the 7800 GTX 256 as the PS5 GPU will against the high end PC GPU's at the time of it's launch (i.e. big Navi and GA102).
     
    Kaotik likes this.
  19. Rockster

    Regular

    Joined:
    Nov 5, 2003
    Messages:
    973
    Likes Received:
    129
    Location:
    On my rock
    Huh? There is just one memory space with 320-bit interface to the memory controller. The only thing that varies is chip densities, so some modules have a greater capacity or addressable range than others. Just because you might request data from a range within one module, that extends beyond the available range of another, does nothing to prevent all modules from being accessed every cycle. As far as how coherent CPU requests impact the GPU, that remains the same regardless.
     
    VitaminB6 and PSman1700 like this.
  20. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    2,703
    Likes Received:
    903
    RSX was basically a generation behind, 8800 series launched autumn 2006 (along with the intel quads). It didnt help it that the 8800 was one of the biggest jumps in history.
    One would hope ps5 gpu fares better then rsx did. But since were at about 14TF now for 2018 released gpus (not counting titan), highest end navi/ampere could be close to 20TF, perhaps with hbm on amds side. Could be close to double, probably without downclocks.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...