General Next Generation Rumors and Discussions [Post GDC 2020]

Discussion in 'Console Industry' started by BRiT, Mar 18, 2020.

  1. vjPiedPiper

    Newcomer

    Joined:
    Nov 23, 2005
    Messages:
    136
    Likes Received:
    88
    Location:
    Melbourne Aus.
    This is not actually true.
    High end devices use RDMA, to move data from 1 device to another without the need for Any kernel/ host interaction.
    (or at least very minimal host interaction, it's only needed to allocate the scatter gather tables for each device, and even that can be made static)

    It's not common, but it's possible.
    Given the different security environment of a console i would not be surprised to see a similar mechanism happen all over the place.
     
    pjbliverpool likes this.
  2. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    415
    Likes Received:
    379
    AMD calls them SDMA engines. SDMA shares with ACEs and GFX the user mode queue based infrastructure for multi-process & virtualisation support. These are present on all GCN APUs and GPUs as far as I am aware.

    if the DMA controller is bespoke, it could tap into the hardware direct p2p doorbell mechanism of AMD GPUs (and HSA), which operates in user mode and requires no CPU involvement (other than setting it up).

    Though my expectation for disk access control flow remains that initiation must require host OS involvement, because the storage is managed by a filesystem with ACL. So perhaps the only opportunity is that DMA destination can be a GPU friendly allocation, and completion of DMA can “ring” the GPU queues directly.
     
    #862 pTmdfx, Apr 8, 2020
    Last edited: Apr 8, 2020
    pjbliverpool likes this.
  3. DSoup

    DSoup Series Soup
    Legend Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    16,775
    Likes Received:
    12,690
    Location:
    London, UK
    Outside of Windows Server running on Hyper-V, isn't RDMA support in Windows restricted to certain network adaptors? I can't see this helping here. There are two layers of barrier, the first is software and is predicated that is data exists on the SSD which operates under a device driver responsible for the appropriate interface (SATA, NVME etc) and needs to get to GDDR on the graphics cards operating under a device driver supplied by AMD, Nvidia or Intel. The second barrier is hardware that these two physical pieces of hardware are separated by two buses.

    There is a good reason why why AMD sold a high-end Radeon Pro with an actual SSD mounted on the card itself. They weren't insane. :nope:
     
    pharma and pjbliverpool like this.
  4. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
    It seems to me that this is all within Microsofts gift to solve via something like DirectStorage although may need to work in concert with SSD vendors to produce "DirectStorage Certified" SSD's with the appropriate RDMA hardware and speed characteristics. Obviously this is more a hope than a prediction but given Microsofts wish to unify development between the PC and Xbox, and also given their focus on the Velocity Architecture in the XSX and in particular it's ability to directly access 100GB of SSD memory from the GPU, it would make sense for them to try and enable that feature for the PC market. Both AMD (via SSG) and Nvidia (GPU DirectStorage) seem geared up on the hardware front and so would presumably welcome this being opened up to the gaming market.

    On Zen2 at least, I assume the same on die PCIe controller manages both the NVMe and the GPU bus. Could this not essentially be treated as one single bus from SSD to GPU if managed by an appropriate storage API and drivers? Pure speculation of course.
     
    #864 pjbliverpool, Apr 8, 2020
    Last edited: Apr 8, 2020
    pharma and PSman1700 like this.
  5. Globalisateur

    Globalisateur Globby
    Veteran Subscriber

    Joined:
    Nov 6, 2013
    Messages:
    4,592
    Likes Received:
    3,411
    Location:
    France
    About notably the velocity engine: They have the slow ram dedicated to the CPU. Won't the textures be loaded in the CPU ram as the cpu will manage the procedure (as they announced it) and the textures will have to be copied to the GPU ram afterwards ? Won't they have to do that for any assets loaded from the SSD ?
     
    egoless likes this.
  6. chris1515

    Legend

    Joined:
    Jul 24, 2005
    Messages:
    7,157
    Likes Received:
    7,965
    Location:
    Barcelona Spain
    The CPU has access to the fast RAM. The assets will directly been write there.
     
    tinokun and BRiT like this.
  7. Globalisateur

    Globalisateur Globby
    Veteran Subscriber

    Joined:
    Nov 6, 2013
    Messages:
    4,592
    Likes Received:
    3,411
    Location:
    France
    How is that possible ? They said the CPU will not use the fast ram, only the slow and the fast ram will be reserved for the GPU.

    About the direct Storage:
    https://news.xbox.com/en-us/2020/03/16/xbox-series-x-glossary/
     
  8. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,104
    Likes Received:
    16,896
    Location:
    Under my bridge
    Nope. They said the fast RAM appears to The CPU and other non-GPU processors as slow RAM at 336 GB/s

    "Memory performance is asymmetrical - it's not something we could have done with the PC," explains Andrew Goossen "10 gigabytes of physical memory [runs at] 560GB/s. We call this GPU optimal memory. Six gigabytes [runs at] 336GB/s. We call this standard memory. GPU optimal and standard offer identical performance for CPU audio and file IO. The only hardware component that sees a difference in the GPU."​

    If the fast RAM wasn't addressable, it wouldn't offer identical performance for the CPU audio and file IO to the slow RAM. Likewise, GPU can access both pools but only the 10 GB portion is full speed.
     
    PSman1700, AzBat, BRiT and 2 others like this.
  9. Betanumerical

    Veteran

    Joined:
    Aug 20, 2007
    Messages:
    1,763
    Likes Received:
    280
    Location:
    In the land of the drop bears
    I read that as the the slow pool doesn't lower performance for CPU audio and file IO, not that the CPU doesn't access the fast pool at the full 560GB/s.
     
  10. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,104
    Likes Received:
    16,896
    Location:
    Under my bridge
    That would be "CPU audio and file IO access the standard memory at the same 336 GB/s". You wouldn't mention the accessing the 'GPU optimal' memory at all if it wasn't possible. It also states performance is asymmetrical, not capacity.
     
  11. DSoup

    DSoup Series Soup
    Legend Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    16,775
    Likes Received:
    12,690
    Location:
    London, UK
    It is within Microsoft's ability to remove this software barrier but they'll need to engineer a new kernel/driver model for Windows 10 and it could spell the return of drive-specific drivers for storage. Fundamentally you have to do the things no operating system wants to do: allow one kernel driver to directly talk to another outside of kernel communication ports. The only way you ever want to do this is with very specific signed-drivers and the Windows kernel implicitly trusting both. :runaway:

    This could be why there is so little specific information on DirectStorage. Forget RDMA, in this regard it is a complete red herring, RDMA is communications protocol for ethernet and other wired networks. It's not a microbus implementation.

    One controller but still multiple buses that are not connected. The human analogy is like a real bus or train, sometimes you need to do a journey in multiple legs and that means getting on a bus, going somewhere, getting off, usually waiting around a bit, then getting on another bus to get where you want too. One bus may be very fast, the other much slower. You don't want to do this with data if you can avoid it :nope: Have you seen the sketchy types who use buses at night? :runaway:
     
    #871 DSoup, Apr 8, 2020
    Last edited: Apr 8, 2020
    pjbliverpool, TheAlSpark and BRiT like this.
  12. Betanumerical

    Veteran

    Joined:
    Aug 20, 2007
    Messages:
    1,763
    Likes Received:
    280
    Location:
    In the land of the drop bears
    I reread the paragraph and it seems you are correct, its interesting they limited the interface for the CPU because it doesn't make much sense to me.
     
  13. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,632
    Location:
    The North
    Likely because that amount of bandwidth is already overkill for the CPU.
    I guess the question is, why not just give it all? What would change if you restricted it vs not restricting it?
     
    BRiT likes this.
  14. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,104
    Likes Received:
    16,896
    Location:
    Under my bridge
    It's a cost reduction measure. 16 GBs @ 560 GB/s (or whatever configuration they'd need) is too expensive, so they created a 560 GB/s 'partition'. But at the same time they didn't want split pools or to limit capacity, so they have this 'best of both' solution. If a game needs more than 3.5 GBs of CPU and audio data, it can dip into the other 10 GB pool
     
  15. Betanumerical

    Veteran

    Joined:
    Aug 20, 2007
    Messages:
    1,763
    Likes Received:
    280
    Location:
    In the land of the drop bears
    I get that, what I don't understand us why the CPU can only access the fast pool at 336GB/s it seems like an unnecessary limitation, surely the cost to allow the CPU to access the fast pool at 560GB/s is minimal at best.
     
  16. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,104
    Likes Received:
    16,896
    Location:
    Under my bridge
    It'll be to do with the bus and controller and stuff. Someone clever will be able to explain it properly. ;) But in short, having one low-speed access route for all the RAM, and one high speed access route for a subset of the chips, gave a good cost compromise by using the particulars of accessing one chip or two and other things. You can be sure that the cost to allow all chips to access all RAM at full speed wasn't minimal otherwise it would have been done. This added complexity isn't advantageous to the system over unified faster access, so the only reason for it as economics where it must be considerable enough to be worth the hassle.
     
    PSman1700 and VitaminB6 like this.
  17. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,632
    Location:
    The North
    So do you think they have 1 set of controllers that can only ever access 6 chips at a time, and another set that can access 4 chips? So the GPU calls both sets of controllers and the CPU only calls one?

    Or is there some other setup that is makes more sense?
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I mentioned a linux kernel change that confirmed a more general level of support in Zen hardware for peer to peer DMA transfers. That could make it more plausible that AMD's silicon can enable more direct accesses, subject to everything else in the chain being able to support the functionality.
    An article that mentions this from last year: https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.2-AMD-Zen-P2P-DMA

    I think one constraint is how much bandwidth can the CPU portion absorb, since a Zen CCX is able to transfer 32 bytes in each direction per clock. That's 32 bytes per CCX clock, which we could assume is the 3.8 or 3.6 GHz rate. That is then further limited by the roughly half-speed of the fabric, although the recent APUs doubled the width of the fabric.
    Since the DRAM bus only used one direction at a time, viewing CPU consumption in terms of 32 bytes per cycle gives ~122 GB/s at 3.8GHz. Double that if the fabric happens to be broad enough to permit the same throughput to the other CCX, and assume the fabric is fast enough to not be the bottleneck. I'm not sure with the examples we have right now what the console fabric would look like, as mobile APUs have an earlier bottleneck due to far lower bandwidth memory.
    Unless you redesign the Zen CCX, the CPU portion of the system has max bandwidth below that of the slow section of RAM.
     
  19. mrcorbo

    mrcorbo Foo Fighter
    Veteran

    Joined:
    Dec 8, 2004
    Messages:
    4,024
    Likes Received:
    2,851
    @Fafalada said (on Resestera) that it might be that the CPU is more sensitive to variable RAM bandwidth.

    And I wonder if it's a Lockhart compatibility thing. If Lockhart were to have 12GB of RAM in a 6X2 configuration for a 192-bit interface, then this would allow some consistency between the two devices on the CPU side, allowing developers to focus mainly (solely) on the GPU differences when developing for the two SKUs.
     
    Rangers, DSoup and AzBat like this.
  20. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    44,104
    Likes Received:
    16,896
    Location:
    Under my bridge
    I don't know. I've never understand RAM buses and chips. :oops: The one thing we can be sure of is if it was better to go single pool, single bus, MS would have done just that. ;)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...