Next Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Discussion in 'Console Technology' started by Proelite, Mar 16, 2020.

  1. jlippo

    Veteran Regular

    Joined:
    Oct 7, 2004
    Messages:
    1,425
    Likes Received:
    536
    Location:
    Finland
    SSD without proper streaming wouldn't. (Loading highest 8k mipmap for an arrowhead on faraway shelf.)

    A proper virtual texturing and such streaming methods with SSD and it should be fine.
    A viewport size and amount of different texture layers will more or less determine the size of texture atlas needed and it's memory use is constant.

    Afterall huge textures are rarely completely in view and there is always places where mipmaps can be used.
    Biggest texture in Rage was 128kx128k and it managed to run on X360/Ps3 although it understandably had some serious texture popping, due to drive performance.
     
    megre and chris1515 like this.
  2. chris1515

    Veteran Regular

    Joined:
    Jul 24, 2005
    Messages:
    4,522
    Likes Received:
    3,350
    Location:
    Barcelona Spain
    But each time you use the slower memory you diminish the total amount of bandwidth. This an example, If you use half a second the slow memory(336 GB/s) and half a second the fast memory(560 GB/s) at the end you have 448 GB/s of bandwidth.

    I understand that OS functionnality which are often accessed can be in the fast memory to have better bandwidth and functionnality rarely used in the slower memory.

    EDIT: I was like you saying why use fast memory for OS this is stupid but at the end the answer is logic.
     
    #1502 chris1515, Mar 31, 2020
    Last edited: Mar 31, 2020
    megre likes this.
  3. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    16,987
    Likes Received:
    6,236
    So, MS are lying when they say the full 10 GB of "fast" memory allocation is for games?

    I still see no valid reason why the OS at any point would require more than the bandwidth available for the "slow" memory allocation. I'd be surprised if at any point the OS would use more than 100 GB/s of memory bandwidth, much less 500+.

    Ryzen 7 3700x and 3900x can't even read from memory that quickly (10's of GB/s not 100's). Does that mean that Windows PCs using those CPUs have slow OS response? Is the OS on XBSX going to be doing something that is impossible on PC? Again we're talking about the OS here and not games. My expectations is that OS will be doing significantly less than a PC OS. It's not like someone will be opening up a large Photoshop project on XBSX or doing massive database searches (in an Azure datacenter perhaps, but not in XBSX).

    Regards,
    SB
     
    #1503 Silent_Buddha, Mar 31, 2020
    Last edited: Mar 31, 2020
    Silenti, BRiT, Michellstar and 5 others like this.
  4. chris1515

    Veteran Regular

    Joined:
    Jul 24, 2005
    Messages:
    4,522
    Likes Received:
    3,350
    Location:
    Barcelona Spain
    It is a balance if OS is fully in slower RAM, you lose more bandwidh

    https://www.resetera.com/threads/pl...ve-ot-secret-agent-cerny.175780/post-30333499

    This is not the number of GB/s, this is how often you need to access the data. Maybe everything from OS is in slow memory but it cost more bandwidth. No solution is perfect.
     
    egoless likes this.
  5. Lurkmass

    Newcomer

    Joined:
    Mar 3, 2020
    Messages:
    102
    Likes Received:
    89
    CUDA dominated because of failed politics inside the Khronos Group who are the same people that standardized OpenCL so don't hope for much since most vendors don't actually care about it.

    It's not like Mac users have any say when Apple deprecated support for Nvidia hardware so their unsupported GPUs are effectively paperweight unless they use bootcamp. Also it depends on what exactly you mean by "ML libraries" since inferencing can be done on any hardware with the likes of Tensorflow Lite.

    If you're expecting to train some models with full featured Tensorflow or PyTorch then no amount of coding on Mac will get you anywhere since they lack the proper APIs to do this. Tensorflow's CUDA kernels uses C++ templates which is a feature that's not available on Apple's latest and greatest Metal API. There's even a ROCm port of Tensorflow to run on AMD hardware which uses their HIP API instead of OpenCL to solve the lack of C++ features as commonly seen on many PC APIs.
     
  6. Proelite

    Veteran Regular Subscriber

    Joined:
    Jul 3, 2006
    Messages:
    1,417
    Likes Received:
    743
    Location:
    Redmond
    https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs

     
  7. chris1515

    Veteran Regular

    Joined:
    Jul 24, 2005
    Messages:
    4,522
    Likes Received:
    3,350
    Location:
    Barcelona Spain
  8. BRiT

    BRiT Verified (╯°□°)╯
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    14,887
    Likes Received:
    13,019
    Location:
    Cleveland
    Sheesh, people trying to make up shit for page hits dont even do basic home work to get things correct when they've already been clearly stated by Microsoft and DigitalFoundry.

    We even had it listed in the system reservations thread with a direct link to the source. Now it's been quoted inlined explicitly.
     
    RagnarokFF, egoless, tinokun and 7 others like this.
  9. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    9,908
    Likes Received:
    9,262
    Location:
    Self Imposed Work Exile: The North
    yea I get that. It’s not something that’s going to happen over night. But someone is buying AMD hardware; over time with a large enough population ideally people may build more libraries for it.

    as for now; yes I’m stuck on nvidia if I want to stay with high level libraries that have GPU support.
     
    pharma and PSman1700 like this.
  10. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,848
    Likes Received:
    5,423
    When you look at chip x-rays from older AMD APUs/GPUs, it looks like the PHYs they use are all 64bit wide. Or maybe they're all 32-bit ones placed in pairs, side-by-side.
    GDDR6 could change that, though.

    Is 40ºC even a realistic proposition? Who's playing videogames at that room temperature?

    AVX256 only came with Haswell I think.

    Ice Lake U is pretty mainstream IMO, though I don't even know if it's usable (the CPU-Z AVX512 test crashes my Ice Lake laptop..)


    Aah, AMD's infamous Game Cache!

    I don't think Cerny would make a presentation that addresses game developers where he claims the typical clocks are reached when the console is running non-gaming apps.
    Besides, most likely the CPU and GPU will lower its clocks and voltage (and disable CUs since the APU has that ability) like hell when watching Netflix.

    Assuming this multiplatform game will be running the exact same game (shader complexity, asset size, etc.).
    Which may not always be the case, especially with the large advantage in I/O performance on one side, and memory bandwidth advantage on the other.

    VRS next gen is likely to be as widespread as depending on SSDs to stream the games' assets.
    nvidia and Intel graphics have had it for a while, and RDNA1 is the odd duck here. Most probably, VRS is part of RDNA2 and Microsoft is implementing their tweaked, customized version that perhaps offers better performance and/or is more flexible than AMD's.

    It seems there's this idea that pushing the hardware to provide better visuals is inherently going to push more power from the chip, which isn't true.

    Someone here already mentioned furmark, and that's a great example, along with e.g. OCCT.
    Nowadays graphics card drivers will automatically limit the clocks if certain loads (like furmark) are detected, yet even then the card consumes more power.
    For example, a 2080 Ti that reaches 1950MHz in an Unity graphics benchmark is limited to 1500MHz if it's running furmark, and in both cases the driver is always trying to touch the TDP limit by boosting the clocks as much as possible. And even at the downclocked 1500MHz frequency the card is running hotter than if it was running regular gaming loads.

    It's not like furmark looks good. The code is above 15 years old AFAIK. It's just a type of code that cycles a lot through certain a part of the rendering pipeline, making the chip's power consumption and heat output reach very high levels.

    So you're saying the One X devkits with extra 4 CUs enabled are horrible?

    Was it ever for anything else?
     
    Mitchings likes this.
  11. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    9,908
    Likes Received:
    9,262
    Location:
    Self Imposed Work Exile: The North
    Yea I get that. I guess I'm thinking about whether it's possible to have high enough texture resolution that the texture is never stretched and is 1x1 with your native resolution at closest camera for all objects. Yes you're only loading in portions of it with virtual texturing. And yes I'm sure we can find a way to load in and out things effectively but can we do that and you still have the benefits of the fast loading with the benefits of the 'instant turn around load everything' with the no pop-in. And the super speed etc.

    It just seems like, if the only limitation to texture size is the footprint it leaves on the hard drive, I'm surprised that this wasn't resolved a long time ago.

    I get where textures are a big part of graphics, the more detail they contain the better everything looks. It's just the way it is.
     
    jlippo, PSman1700 and BRiT like this.
  12. RobertR1

    RobertR1 Pro
    Legend

    Joined:
    Nov 2, 2005
    Messages:
    5,725
    Likes Received:
    901
    I don't stay current on mobile arch's but yeah it seems to

    You can run Prime 95 29.8 or latest AIDA 64 (fpu only) if you want to test AVX512. CPU-Z is trash tier as a stress test.
     
  13. Metal_Spirit

    Regular Newcomer

    Joined:
    Jan 3, 2007
    Messages:
    546
    Likes Received:
    336
    [
    Not shure I get what you guys are talking about. But it seems to me Xbox series X does have indeed a bootleneck in bandwidth if both pools of memory are accessed at the same time.

    Problem lies within the memory configurations. The Xbox has 10 memory chips, 4 with 1 GB, and 6 with 2 GB. To get two pools one with 10 GB at 560 GB/s (320 bits) and one with 6 GB with 336 GB/s (192 bits), the memory disposition must be 4x1 GB modules accessed at 32 bits, and the first 1 GB of the six 2 GB modules also accessed at 32 bits.
    This with 5x64 bits controllers gives you 320 bits access to all these chips, each provinding 56 GB/s, so 10 chips with 1 GB equals 10 GB at 560 GB/s.
    Now for the other pool, you need to access the extra 1 GB on the 2 GB modules. Since each is connected with a 32 bits bus, and there are 6 modules, thats a 192 bits bus... and that equates to 6 GB at 336 GB/s.

    Big problem is that you are quantifying the same 32 bits channel on the 2 GB modules on both pools: if that's ok to quantify maximum bandwidth on each of the pools, it doesn´t work like that in reality, since it´s the same bus on both.And if you are using the 32 bits on one pool, you cannot be using the same 32 bits channel on the other.

    So to access both pools at full 32 bits the simple choice is to do it in alternate clock cycles. This is about the same as reducing the bus width to 16 bits for each pool, and acessing both at the same time.

    Since the 1 GB modules are free from this, they will still provide 224 GB/s in total. But the 2 GB modules will provide half, reducing the 10 GB pool memory bandwidth to 392 GB/s, and the 6 GB one to 168 GB/s.

    I really don´t know how this can be solved... Any ideas?
     
    #1513 Metal_Spirit, Mar 31, 2020
    Last edited: Mar 31, 2020
    Mitchings likes this.
  14. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    9,908
    Likes Received:
    9,262
    Location:
    Self Imposed Work Exile: The North
    This is the first I've ever heard of a 192 bit bus. You would draw lanes to both chips but certainly not 16 bits to each chip. Why would you do that?
    Nothing here indicates that the speeds of the memory change at all.

    I thought this was straight forward:
    Same pool of memory, different chip sizes. Slow and fast is really more like do you want to pull from 10 or 6.

    There are 10 chips in total, each chip has 56 GB/s bandwidth on a 320 bit bus.
    56 * 10 = 560 GB/s
    Bandwidth is the total size of the pipe, and in this case it will be about the total amount of pull you can grab at once.
    Of the 10 chips, 6 of them are 2 GB in size.
    56 * 6 = 366 GB/s

    If you data is on the 2nd half of the 1GB chip you will get 366 GB/s because you only have 6 chips to pull that data from. I don't care how data is stored on the 2GB chips, it will always be able to return 32 bits of data per clock cycle. Whether it's 32 bits to each GB, or 16 bits to both and the data is split. Whatever the case is, it's returning 32 bits through the main controller every single time.

    But you still have 4 bus openings on the remaining 4 chips, available, just because it's accessing the back half of those 2GB chips, doesn't mean it closes off the other remaining lanes.

    So you can still pull 56 *4 on the remaining 1 GB chips.
    which is 224 GB/s

    so adding these together, it is back to 560 GB/s.

    There is no averaging of memory
    There is no split pool.

    You're only downside is if you put _all_ of your data on the 6x2GB chips, you're limited to a bandwidth of 336 GB/s. Because you'll grab the data on 1/2 and then if you need data on the other 1/2, you'll need to alternate. But that can be handled by priority. But that doesn't stop the developers from always fully utilizing all lanes to achieve its 560 GB/s.

    Regardless if you are alternating or not, those 6 chips will be constantly giving out 336GB/s.
    And regardless if you are alternating or not on the 6x2 GB chips, you still have 4x1 GB chips waiting to go. Giving you a total of 560 GB/s bandwidth whenever all 10 chips are being utilized.


    This should not be treated like 2 separate ram pools like they did with 360 or XBO.

    Because of the imbalance in CPU to GPU, perhaps you'll just prioritize the GPU.
    While I'm not sure how priority works, edit: they'll priortize whatever is going to give the most performance. Anyway all GPU data goes into the single chips. This is an easy one. Remaining GPU data goes into the 1GB of data. The CPU data would sit on top of that, and any sort of GPGPU work that may need to be done.

    TLDR; I don't really see an issue here unless you're always got contention on those 6x2 GB chips. You'd still have this probably anyway. No one would be trying this argument if it was 10x2GB chips. You'd still have the same issue if you're trying to access the memory controller to pull the data all on the same chips. It would still be 560GB/s, you just have 20 GB of memory. Would you use your current argument to say that 10x2GB chips has bottlenecked because the data is split over 2GB chips and now it needs to alternate? Or have some sort of custom controller to do this inn which it needs to trade off bandwidth?
     
    #1514 iroboto, Mar 31, 2020
    Last edited: Mar 31, 2020
  15. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    575
    Likes Received:
    253
    No, in these kind of systems you always prioritize the CPU first, because it gets hit much worse by the added latency of waiting, and because the GPU can trivially use all of the bandwidth pretty much constantly, while the CPU cannot, so if you prioritize the GPU you can end up completely starving the CPU of resources.
     
    egoless, PSman1700, disco_ and 3 others like this.
  16. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    9,908
    Likes Received:
    9,262
    Location:
    Self Imposed Work Exile: The North
    yea perhaps. You might be right.
    But what if the CPU is sitting around idle most of the time anyway? Would you still prioritize it?

    Well clearly the prioritization is probably a lot more complex than we are making it lol
     
  17. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    575
    Likes Received:
    253
    If it's sitting idle most of the time, it's not consuming a lot of ram bandwidth and therefore having it prioritized doesn't hurt you much.
     
    jgp, disco_ and BRiT like this.
  18. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    9,908
    Likes Received:
    9,262
    Location:
    Self Imposed Work Exile: The North
    That's also true.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,348
    Likes Received:
    3,881
    Location:
    Well within 3d
    The Navi operations increase throughput over a conventional FMA by acting like packed math and then allowing the results to be combined into an accumulator. Looking at the patent or how the tensor operations work for Nvidia, it looks like it would be a fraction of what the matrix ops would do. The lane format without a matrix unit would allow those dot operations to generate only the results along a diagonal of that big matrix.
    The AMD scheme is more consistent with Vega, as the vector unit is 16-wide, and the new hardware may align with code referencing new instructions and registers for Arcturus. One other indication this is different is that the Navi dot instruction would take up a normal vector instruction slot since it happens in the same block. Arcturus and this matrix method would allow at least some normal vector traffic in parallel.

    The scenario where the system is spending half a second in the slow pool requires something in the OS, an app, or a game resource put in the slow section needing 168 GB/s of bandwidth.
    There is some impact because of the imbalance, but it scales by what percentage of the memory access mixture goes to the slow portion. If a game did that, it would likely be considered a mis-allocation. A background app would likely be inactive or prevented from having anything like that access rate, and the OS gets by with a minority share of the tens of GB/s in normal PCs without issue.
    I can see the OS sporadically interacting with shared buffers for signalling purposes or copying data from secured memory to a place where the game can use it, but that's on the the order of things like networking or the less than 10 GB/s disk IO.

    If the GDDR6 chips were all the same capacity, there would still be a "pool" for the OS and apps, since accesses for them wouldn't be going to the game. The individual controllers would see some percentage of accesses going to them that the game wouldn't be able to use. Let's say 1% goes the OS, or 5.6GB/s. The game experiences a bandwidth bottleneck if it needs something like 555 GB/s in that given second. If there's a set of code, sound data, or rarely accessed textures that don't get used in the current game scene unless the user hits a specific button or action, finally hitting that action while the game is going on blocks the other functions' accesses for some number of cycles.
    With the non-symmetric layout, the OS or slow pool pushes some of that percentage onto the channels associated with the 2GB chips.
    Going by the 1% scenario, the six controllers would need to find room for the 40% of the OS traffic that cannot be striped across the smaller chips, or 40% of 5.6 GB/s. The 336 GB/s pool would be burdened with an extra 2.24GB/s.

    Unless something in the slow pool is demanding a significant fraction of the bandwidth, and I don't know what functionality other than the game's renderer that needs bandwidth on that order of magnitude, I can see why Microsoft saw it as a worthwhile trade-off.
    If a game put one of its most heavily used buffers in the slow pool, I think the assumption is that the developers would find a way to move it out of there or scale it back so it'd fit elsewhere.
     
  20. mrcorbo

    mrcorbo Foo Fighter
    Veteran

    Joined:
    Dec 8, 2004
    Messages:
    3,898
    Likes Received:
    2,597
    There's no reason to think they are doing this. When you access an address in slow RAM (or it's the CPU/IO), you do it at 192-bits. When you access an address in fast RAM (and it's the GPU), it's done at 320-bit. There's no reason why you'd have to alternate cycles. Which pool you were accessing would be totally dictated by client (GPU,GPU or I/O) demand.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...