Next Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Discussion in 'Console Technology' started by Proelite, Mar 16, 2020.

  1. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,869
    Likes Received:
    10,960
    Location:
    The North
    hmm. I would wait for a deep dive before you jump on this line of thought. You don't know how their memory controllers work, how many they have, and how their interweaving works out.
    You are looking at an absolute worst case and making this comparison assuming the memory controller won't fill the remaining 192 GB/s with other data.

    But lets do this anyway for the sake of clarity. I will work out your scenario as to what should be happening at a simplistic, amateur but granular level.

    Anyone feel free to correct me here; lots of senior members around lately.
    Lets assume a best case scenario for PS5 that data needed is 512 bits.

    XSX
    1st clock cycle: 320 bits off the first 10 chips
    2nd clock cycle: 192 bits off the remaining 6 GB chips and the other lanes are wasted
    Total = 512 bits pulled in 2 clock cycles.

    PS5:
    2 clock cycles for PS5:
    It will grab 256
    Then another 256 bits
    Total = 512 Bits

    This is of course, if we assume the memory controller setup such that it will _not_ fill the extra lanes and in a situation where you have _2_ devices contending for memory. Then both are exactly equal in the worst case scenario you speak of.

    But lets look at another case then:
    If data was sized and spread in such a way that it was exactly 40 bytes or exactly 320 bits, or the best case scenario for XSX.

    XSX will grab all this data in 1 clock cycle
    PS5 will need 2 cycles to do this and on the second cycle it wastes the remaining chip lanes for the request.

    Lets look at a real example then.
    4KB or 40,960 bits. This is a standard hard drive block.
    This divides perfectly into the 10x32 bit bus and it will access the memory all 10 chips every time in full 32 bit blocks. This is the case for anything in multiple of 4KB.

    40960 bits is respectively:
    XSX: 128 clock cycles
    PS5: 160 clock cycles

    They are the same memory speeds. So there are no additional differences here. So now XSX can start processing another request 32 clock cycles before PS5 completes.

    What about 1024 KB then?
    XSX: 32768 clock cycles
    PS5: 40960 clock cycles

    Alright so from this we see that if the memory is not full, XSX can go much faster than PS5.
    So lets look at the worst case scenario, all 16 GB is full. Lets have them race to offload all 16 GB.
    Well earlier I showed you that it would take 2 clock cycles for XSX to take 320 bits off the first 10 GB chips followed by a second clock cycle to take off 192 bits off the remaining 6 GB chips.

    In 2 clock cycles this equated PS5.

    So that means in a race, for the first 6 GB of 16GB, they will be equal in the number of clock cycles to clear out 6 GB.
    The remaining 10 GB, XSX will blast through it before PS5.

    And there you have your scenario played out. And this is why we don't average speeds.
     
    #1581 iroboto, Mar 31, 2020
    Last edited: Mar 31, 2020
    RagnarokFF, blakjedi, AzBat and 8 others like this.
  2. McHuj

    Veteran Regular Subscriber

    Joined:
    Jul 1, 2005
    Messages:
    1,563
    Likes Received:
    758
    Location:
    Texas
    I would be really curious to learn how the XSX memory controller works and manages memory transactions on this 320-bit bus. Every thing I've ever seen (at least the hardware I worked with), bus width were always powers of 2 and cache line fetches broke down nicely into them to avoid partial transactions.

    Are GPU transactions just big? Does a GPU fetch multiple lines in one transaction? Or can the controller coalesce transactions in some efficient manner to minimize overhead and dead cycles.
     
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,591
    Likes Received:
    994
    It's not one big 320 bit bus, it's either 5 independent 64 bit busses or 10 32 bit ones.

    Cheers
     
  4. Metal_Spirit

    Regular Newcomer

    Joined:
    Jan 3, 2007
    Messages:
    558
    Likes Received:
    341
    Let me review your cases, because it seems to me you are useing best case scenarios on Xbox on all examples.

    "Lets assume a best case scenario for PS5 that data needed is 512 bits.

    XSX
    1st clock cycle: 320 bits off the first 10 chips
    2nd clock cycle: 192 bits off the remaining 6 GB chips and the other lanes are wasted
    Total = 512 bits pulled in 2 clock cycles.

    PS5:
    2 clock cycles for PS5:
    It will grab 256
    Then another 256 bits
    Total = 512 Bits"

    Aren´t you assuming a perfect data disposition on the chips?
    What if the data needed for the second cycle was on the 10 MB?
    Xbox would be wasting remaining lanes.

    Now, let's assume all data is on the 6GB slow memory. It would require 3 cycles and some waste also!

    "If data was sized and spread in such a way that it was exactly 40 bytes or exactly 320 bits, or the best case scenario for XSX.

    XSX will grab all this data in 1 clock cycle
    PS5 will need 2 cycles to do this and on the second cycle it wastes the remaining chip lanes for the request."

    What if that data is fragmented on both memories? You would require 2 cycles
    Or if it is all on the slower memory? 2 cycles plus waste.

    "Lets look at a real example then.
    4KB or 40,960 bits. This is a standard hard drive block.
    This divides perfectly into the 320 bit bus and it will access the memory all 10 chips every time in full 32 bit blocks. This is the case for anything in multiple of 4KB.

    40960 bits is respectively:
    XSX: 128 clock cycles
    PS5: 160 clock cycles"

    Another best case scenario for Xbox... Is it not?
    You are assuming you can read all from the 10 GB But we are talking 4K. What if those 4K are in slower memory? 214 cycles for Xbox!

    I think you got my point!

    Regardless I never introduced PS5 to the equation. Never compared performance of both! I was just talking that those 560 GB/s on Xbox X. like Tflops, are not telling everything. And that penalties can occur creating penalties to performance. If I ever mentioned PS5 was due to the fact Sony's console does not present these problems. Nothing else.

    But yes... I'm comparing this with the Geforce case... It can happen!
     
    egoless and KeanuReeves like this.
  5. McHuj

    Veteran Regular Subscriber

    Joined:
    Jul 1, 2005
    Messages:
    1,563
    Likes Received:
    758
    Location:
    Texas
    Interesting. I didn't think of it that way. I always assumed its one big bus vs a collection of smaller buses.
     
  6. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,869
    Likes Received:
    10,960
    Location:
    The North
    Its 10x32 bit.
    Wait, nvm, it may not be.
     
    #1586 iroboto, Mar 31, 2020
    Last edited: Mar 31, 2020
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,367
    Likes Received:
    3,961
    Location:
    Well within 3d
    Usually, these accesses are in the form of a cache line or two, depending on the specifics of the line length. GCN had a mixture of 32B and 64B lines, x86 is usually 64B, and RDNA is a mix of 64B and 128B. DRAM pages are on the order of several KB, with GDDR6 being 2KB from what I've searched up.
    The chips have 2 independent channels each, and the optimal case is for a transaction to be satisfied by one channel/controller.

    While there are instances of IO or some CPU work that might not align well, the common case is that DRAM can provide 32B per burst, and likes it if you can access the next KB or so before moving along. The caches expect some small multiple of 32B or 64B for their transactions, and things like the GPU's overall rasterization pipeline exist to match up well with DRAM and cache alignment. Hence why techniques that are less coherent tend to do poorly, and why it's not that easy to replace the hardware that matches the memory so well.

    This generally isn't going to happen.
    A GDDR6 DRAM data payload comes over a channel over a burst of 16 clocks. That is usually half of a cache line (meaning next DRAM transaction is very predictable), and almost all accesses are going to be going through the cached memory pipeline.
    An attempt to pull just the first 320 bits over the whole system leaves 15 subsequent bus cycles where a transaction is either reading the data or wasting 15/16 of the bandwidth.

    Each GDDR6 chip has two independent 16-bit channels (subject to undisclosed details of AMD's hardware), and each channel gets half the chip's physical capacity. The DRAM arrays encourage linear accesses and hitting open pages as much as possible, because there are very large overheads related to changing banks or changing the activity over the bus.
     
    jgp, DavidGraham, Gubbi and 5 others like this.
  8. zupallinere

    Regular Subscriber

    Joined:
    Sep 8, 2006
    Messages:
    750
    Likes Received:
    96
    This would be a small win maybe for the PS5 since you could get a larger SSD to even things out. I doubt it is worth the cost early on but if you wanted to be on the safe side.
     
  9. MrFox

    MrFox Deludedly Fantastic
    Legend Veteran

    Joined:
    Jan 7, 2012
    Messages:
    6,488
    Likes Received:
    5,995
    Higher efficiency and extremely compact PSUs go up in price very quickly because they use some really expensive parts to get low switching losses and very high switching frequencies. So it looks like it's never been worth any effort to improve this on stationary devices, it just costs more for not much gain on a home console.

    But I love the new stuff coming into market which could help make so crazy small PSUs. The second generation of GaN FETs could make a 500W PSU the size of a wallwart.
     
    Silent_Buddha, BRiT and zupallinere like this.
  10. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,869
    Likes Received:
    10,960
    Location:
    The North
    Thanks,

    What would be a more basic measure, a 1/2 cache line?
    How does 4KB, 1024KB, and 1024MB get divided among the memory banks (typical scenario)? Will it divide it over all the chips, or toss it into a single chip?
     
  11. zupallinere

    Regular Subscriber

    Joined:
    Sep 8, 2006
    Messages:
    750
    Likes Received:
    96
    Yeah Anker went big on that early on and it looks nice.
     
  12. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    14,212
    Likes Received:
    5,651
    @Metal_Spirit This is a console with a low level API. They'll be able to design their memory layout to mitigate any issues vs an older PC GPU that was using a high-level API that did not necessarily have an explicit allocation of VRAM.
     
  13. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    14,212
    Likes Received:
    5,651
    If your read is small than 1 cache line, you basically waste bandwidth. If a cache line is 64B and you read 32B, you waste 32B and get half the effective bandwidth. That's my understanding of it.

    Edit: Well, I guess it's more complicated. If the next read you want is the next 32B, then you've already cached it and it's not wasted. But if the next 32B is irrelevant data, then you wasted half the cache line and you halved your bandwidth. CPU may be different than GPU.
     
    TheAlSpark, BRiT and iroboto like this.
  14. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,869
    Likes Received:
    10,960
    Location:
    The North
    That makes sense. Yea it's just too small of an amount.
    You'd have to read something larger.
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,367
    Likes Received:
    3,961
    Location:
    Well within 3d
    Cache transactions are the size of their lines, so that's a good base unit.
    How pages are striped over channels varies. GPUs can stripe data so that each channel gets 128 bytes, or that was the case for some older APUs. The granularity for the ROPs, texture units, pixel quads, and rasterizer at the size of popular formats likely encouraged this. If the desire is to have as much parallel access to memory as possible, a pixel export that spits out hundreds of bytes isn't served by sending all that traffic into a single destination.
    CPU preferences go the other way, where there's not as much bandwidth, but the CPU wants to avoid costly latency penalties if it needs to change DRAM pages. At the same time, if there are multiple NUMA nodes, striping across nodes can either balance utilization or choke a high-bandwidth application.

    There's no single right answer, so it comes down to what the system wants to optimize for.
     
  16. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,869
    Likes Received:
    10,960
    Location:
    The North
    You're a beast.
    You are:
    [​IMG]

    Okay, so at this point in time; there's no way to tell what MS did with their memory controller setup is what you're saying. So without bench-marking, we really don't know how it's going to perform or even behave.
     
    #1596 iroboto, Mar 31, 2020
    Last edited: Mar 31, 2020
    egoless, AzBat, Proelite and 2 others like this.
  17. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    10,869
    Likes Received:
    10,960
    Location:
    The North
    There are multiple forum posts and videos going around the interwebs about this with your exact axioms. I figured I would address the claim that PS5 had more bandwidth than XSX through averaging. The idea of a slow and fast pool of memory is equally aggravating. Its the same damn clockrate and bus width. 6 chips take longer to fill up than the other 4. There is no fast and slow though.

    Sorry though, didn't mean to imply.
     
    #1597 iroboto, Mar 31, 2020
    Last edited: Mar 31, 2020
  18. RobertR1

    RobertR1 Pro
    Legend

    Joined:
    Nov 2, 2005
    Messages:
    5,747
    Likes Received:
    949
    We’re gonna need to be able to run pc synthetics on these consoles to get the shitpost value to sky rocket.
     
    egoless and iroboto like this.
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,367
    Likes Received:
    3,961
    Location:
    Well within 3d
    I think it's possible that the GPU-optimized portion could stripe data differently than the standard portion of memory, since the GPU-optimized portion is supposed to give the GPU as much bandwidth as possible, and striping can give the GPU more opportunities to generate parallel traffic.
    NUMA considerations aren't really a concern in a single-chip system, so I think that won't be something they'll optimize towards.

    I think it'd be fine to expect the system to do well in utilizing bandwidth in the GPU-optimized memory, since the GPU is so important to the console's purpose. AMD has indicated it's improved how well its memory controllers balance CPU and GPU traffic, or at least I hope they have since 2013.
     
    BRiT, PSman1700, Proelite and 2 others like this.
  20. Rockster

    Regular

    Joined:
    Nov 5, 2003
    Messages:
    973
    Likes Received:
    129
    Location:
    On my rock
    I'm honestly confused by the notion that there is a different performance expectation due to asymmetrical chip densities. If it were to use all 2gb chips, the arbitration of CPU and GPU access by the memory controller would be no different other than managing it over the entire address space as opposed to a portion of it.
     
    VitaminB6 and PSman1700 like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...