General Next Generation Rumors and Discussions [Post GDC 2020]

vjPiedPiper · Apr 8, 2020

DSoup said:
Right now there is no interconnect connect between device drivers in the Windows kernel, this segmentation is deliberate. If you want the SSD to serve GPU with as little overhead (bottlenecks) as possible then ideally you want to connect these two device drivers and let them manage I/O without there kernel having too.

This is not actually true.
High end devices use RDMA, to move data from 1 device to another without the need for Any kernel/ host interaction.
(or at least very minimal host interaction, it's only needed to allocate the scatter gather tables for each device, and even that can be made static)

It's not common, but it's possible.
Given the different security environment of a console i would not be surprised to see a similar mechanism happen all over the place.

pTmdfx · Apr 8, 2020

vjPiedPiper said:
This is not actually true.
High end devices use RDMA, to move data from 1 device to another without the need for Any kernel/ host interaction.
(or at least very minimal host interaction, it's only needed to allocate the scatter gather tables for each device, and even that can be made static)

It's not common, but it's possible.
Given the different security environment of a console i would not be surprised to see a similar mechanism happen all over the place.

AMD calls them SDMA engines. SDMA shares with ACEs and GFX the user mode queue based infrastructure for multi-process & virtualisation support. These are present on all GCN APUs and GPUs as far as I am aware.

if the DMA controller is bespoke, it could tap into the hardware direct p2p doorbell mechanism of AMD GPUs (and HSA), which operates in user mode and requires no CPU involvement (other than setting it up).

Though my expectation for disk access control flow remains that initiation must require host OS involvement, because the storage is managed by a filesystem with ACL. So perhaps the only opportunity is that DMA destination can be a GPU friendly allocation, and completion of DMA can “ring” the GPU queues directly.

Deleted member 11852 · Apr 8, 2020

vjPiedPiper said:
This is not actually true.
High end devices use RDMA, to move data from 1 device to another without the need for Any kernel/ host interaction.
(or at least very minimal host interaction, it's only needed to allocate the scatter gather tables for each device, and even that can be made static)

Outside of Windows Server running on Hyper-V, isn't RDMA support in Windows restricted to certain network adaptors? I can't see this helping here. There are two layers of barrier, the first is software and is predicated that is data exists on the SSD which operates under a device driver responsible for the appropriate interface (SATA, NVME etc) and needs to get to GDDR on the graphics cards operating under a device driver supplied by AMD, Nvidia or Intel. The second barrier is hardware that these two physical pieces of hardware are separated by two buses.

There is a good reason why why AMD sold a high-end Radeon Pro with an actual SSD mounted on the card itself. They weren't insane. :nope:

pjbliverpool · Apr 8, 2020

DSoup said:
Outside of Windows Server running on Hyper-V, isn't RDMA support in Windows restricted to certain network adaptors? I can't see this helping here. There are two layers of barrier, the first is software and is predicated that is data exists on the SSD which operates under a device driver responsible for the appropriate interface (SATA, NVME etc) and needs to get to GDDR on the graphics cards operating under a device driver supplied by AMD, Nvidia or Intel.

It seems to me that this is all within Microsofts gift to solve via something like DirectStorage although may need to work in concert with SSD vendors to produce "DirectStorage Certified" SSD's with the appropriate RDMA hardware and speed characteristics. Obviously this is more a hope than a prediction but given Microsofts wish to unify development between the PC and Xbox, and also given their focus on the Velocity Architecture in the XSX and in particular it's ability to directly access 100GB of SSD memory from the GPU, it would make sense for them to try and enable that feature for the PC market. Both AMD (via SSG) and Nvidia (GPU DirectStorage) seem geared up on the hardware front and so would presumably welcome this being opened up to the gaming market.

The second barrier is hardware that these two physical pieces of hardware are separated by two buses.

On Zen2 at least, I assume the same on die PCIe controller manages both the NVMe and the GPU bus. Could this not essentially be treated as one single bus from SSD to GPU if managed by an appropriate storage API and drivers? Pure speculation of course.

Globalisateur · Apr 8, 2020

pjbliverpool said:
It seems to me that this is all within Microsofts gift to solve via something like DirectStorage although may need to work in concert with SSD vendors to produce "DirectStorage Certified" SSD's with the appropriate RDMA hardware and speed characteristics. Obviously this is more a hope than a prediction but given Microsofts wish to unify development between the PC and Xbox, and also given their focus on the Velocity Architecture in the XSX and in particular it's ability to directly access 100GB of SSD memory from the GPU, it would make sense for them to try and enable that feature for the PC market. Both AMD (via SSG) and Nvidia (GPU DirectStorage) seem geared up on the hardware front and so would presumably welcome this being opened up to the gaming market.

On Zen2 at least, I assume the same on die PCIe controller manages both the NVMe and the GPU bus. Could this not essentially be treated as one single bus from SSD to GPU if managed by an appropriate storage API and drivers? Pure speculation of course.

About notably the velocity engine: They have the slow ram dedicated to the CPU. Won't the textures be loaded in the CPU ram as the cpu will manage the procedure (as they announced it) and the textures will have to be copied to the GPU ram afterwards ? Won't they have to do that for any assets loaded from the SSD ?

chris1515 · Apr 8, 2020

Globalisateur said:
About notably the velocity engine: They have the slow ram dedicated to the CPU. Won't the textures be loaded in the CPU ram as the cpu will manage the procedure (as they announced it) and the textures will have to be copied to the GPU ram afterwards ? Won't they have to do that for any assets loaded from the SSD ?

The CPU has access to the fast RAM. The assets will directly been write there.

Globalisateur · Apr 8, 2020

chris1515 said:
The CPU has access to the fast RAM. The assets will directly been write there.

How is that possible ? They said the CPU will not use the fast ram, only the slow and the fast ram will be reserved for the GPU.

About the direct Storage:

Modern games perform asset streaming in the background to continuously load the next parts of the world while you play, and DirectStorage can reduce the CPU overhead for these I/O operations from multiple cores to taking just a small fraction of a single core;

https://news.xbox.com/en-us/2020/03/16/xbox-series-x-glossary/

Shifty Geezer · Apr 8, 2020

Globalisateur said:
How is that possible ? They said the CPU will not use the fast ram, only the slow and the fast ram will be reserved for the GPU.

Nope. They said the fast RAM appears to The CPU and other non-GPU processors as slow RAM at 336 GB/s

"Memory performance is asymmetrical - it's not something we could have done with the PC," explains Andrew Goossen "10 gigabytes of physical memory [runs at] 560GB/s. We call this GPU optimal memory. Six gigabytes [runs at] 336GB/s. We call this standard memory. GPU optimal and standard offer identical performance for CPU audio and file IO. The only hardware component that sees a difference in the GPU."

If the fast RAM wasn't addressable, it wouldn't offer identical performance for the CPU audio and file IO to the slow RAM. Likewise, GPU can access both pools but only the 10 GB portion is full speed.

Betanumerical · Apr 8, 2020

Shifty Geezer said:
Nope. They said the fast RAM appears to The CPU and other non-GPU processors as slow RAM at 336 GB/s

"Memory performance is asymmetrical - it's not something we could have done with the PC," explains Andrew Goossen "10 gigabytes of physical memory [runs at] 560GB/s. We call this GPU optimal memory. Six gigabytes [runs at] 336GB/s. We call this standard memory. GPU optimal and standard offer identical performance for CPU audio and file IO. The only hardware component that sees a difference in the GPU."

If the fast RAM wasn't addressable, it wouldn't offer identical performance for the CPU audio and file IO to the slow RAM. Likewise, GPU can access both pools but only the 10 GB portion is full speed.

I read that as the the slow pool doesn't lower performance for CPU audio and file IO, not that the CPU doesn't access the fast pool at the full 560GB/s.

Shifty Geezer · Apr 8, 2020

Betanumerical said:
I read that as the the slow pool doesn't lower performance for CPU audio and file IO, not that the CPU doesn't access the fast pool at the full 560GB/s.

That would be "CPU audio and file IO access the standard memory at the same 336 GB/s". You wouldn't mention the accessing the 'GPU optimal' memory at all if it wasn't possible. It also states performance is asymmetrical, not capacity.

Deleted member 11852 · Apr 8, 2020

pjbliverpool said:
It seems to me that this is all within Microsofts gift to solve via something like DirectStorage although may need to work in concert with SSD vendors to produce "DirectStorage Certified" SSD's with the appropriate RDMA hardware and speed characteristics.

It is within Microsoft's ability to remove this software barrier but they'll need to engineer a new kernel/driver model for Windows 10 and it could spell the return of drive-specific drivers for storage. Fundamentally you have to do the things no operating system wants to do: allow one kernel driver to directly talk to another outside of kernel communication ports. The only way you ever want to do this is with very specific signed-drivers and the Windows kernel implicitly trusting both. :runaway:

This could be why there is so little specific information on DirectStorage. Forget RDMA, in this regard it is a complete red herring, RDMA is communications protocol for ethernet and other wired networks. It's not a microbus implementation.

pjbliverpool said:
On Zen2 at least, I assume the same on die PCIe controller manages both the NVMe and the GPU bus. Could this not essentially be treated as one single bus from SSD to GPU if managed by an appropriate storage API and drivers? Pure speculation of course.

One controller but still multiple buses that are not connected. The human analogy is like a real bus or train, sometimes you need to do a journey in multiple legs and that means getting on a bus, going somewhere, getting off, usually waiting around a bit, then getting on another bus to get where you want too. One bus may be very fast, the other much slower. You don't want to do this with data if you can avoid it :nope:

Have you seen the sketchy types who use buses at night? :runaway:

Betanumerical · Apr 8, 2020

Shifty Geezer said:
That would be "CPU audio and file IO access the standard memory at the same 336 GB/s". You wouldn't mention the accessing the 'GPU optimal' memory at all if it wasn't possible. It also states performance is asymmetrical, not capacity.

I reread the paragraph and it seems you are correct, its interesting they limited the interface for the CPU because it doesn't make much sense to me.

iroboto · Apr 8, 2020

Betanumerical said:
I reread the paragraph and it seems you are correct, its interesting they limited the interface for the CPU because it doesn't make much sense to me.

Likely because that amount of bandwidth is already overkill for the CPU.
I guess the question is, why not just give it all? What would change if you restricted it vs not restricting it?

Shifty Geezer · Apr 8, 2020

Betanumerical said:
I reread the paragraph and it seems you are correct, its interesting they limited the interface for the CPU because it doesn't make much sense to me.

It's a cost reduction measure. 16 GBs @ 560 GB/s (or whatever configuration they'd need) is too expensive, so they created a 560 GB/s 'partition'. But at the same time they didn't want split pools or to limit capacity, so they have this 'best of both' solution. If a game needs more than 3.5 GBs of CPU and audio data, it can dip into the other 10 GB pool

Betanumerical · Apr 8, 2020

Shifty Geezer said:
It's a cost reduction measure. 16 GBs @ 560 GB/s (or whatever configuration they'd need) is too expensive, so they created a 560 GB/s 'partition'. But at the same time they didn't want split pools or to limit capacity, so they have this 'best of both' solution. If a game needs more than 3.5 GBs of CPU and audio data, it can dip into the other 10 GB pool

I get that, what I don't understand us why the CPU can only access the fast pool at 336GB/s it seems like an unnecessary limitation, surely the cost to allow the CPU to access the fast pool at 560GB/s is minimal at best.

Shifty Geezer · Apr 8, 2020

It'll be to do with the bus and controller and stuff. Someone clever will be able to explain it properly.

But in short, having one low-speed access route for all the RAM, and one high speed access route for a subset of the chips, gave a good cost compromise by using the particulars of accessing one chip or two and other things. You can be sure that the cost to allow all chips to access all RAM at full speed wasn't minimal otherwise it would have been done. This added complexity isn't advantageous to the system over unified faster access, so the only reason for it as economics where it must be considerable enough to be worth the hassle.

iroboto · Apr 8, 2020

Shifty Geezer said:
It'll be to do with the bus and controller and stuff. Someone clever will be able to explain it properly. But in short, having one low-speed access route for all the RAM, and one high speed access route for a subset of the chips, gave a good cost compromise by using the particulars of accessing one chip or two and other things. You can be sure that the cost to allow all chips to access all RAM at full speed wasn't minimal otherwise it would have been done. This added complexity isn't advantageous to the system over unified faster access, so the only reason for it as economics where it must be considerable enough to be worth the hassle.

So do you think they have 1 set of controllers that can only ever access 6 chips at a time, and another set that can access 4 chips? So the GPU calls both sets of controllers and the CPU only calls one?

Or is there some other setup that is makes more sense?

3dilettante · Apr 8, 2020

pjbliverpool said:
3dilettante made some comments on this a few days back but I can't find the post now. If I recall correctly he said the hardware does support point to point requests (from GPU to SSD for example) but that the SSD itself would need to support that and presumably there would need to be some software API to allow it. Whether DirectStorage will fill that gap, who can say.

I mentioned a linux kernel change that confirmed a more general level of support in Zen hardware for peer to peer DMA transfers. That could make it more plausible that AMD's silicon can enable more direct accesses, subject to everything else in the chain being able to support the functionality.
An article that mentions this from last year: https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.2-AMD-Zen-P2P-DMA

Betanumerical said:
I get that, what I don't understand us why the CPU can only access the fast pool at 336GB/s it seems like an unnecessary limitation, surely the cost to allow the CPU to access the fast pool at 560GB/s is minimal at best.

I think one constraint is how much bandwidth can the CPU portion absorb, since a Zen CCX is able to transfer 32 bytes in each direction per clock. That's 32 bytes per CCX clock, which we could assume is the 3.8 or 3.6 GHz rate. That is then further limited by the roughly half-speed of the fabric, although the recent APUs doubled the width of the fabric.
Since the DRAM bus only used one direction at a time, viewing CPU consumption in terms of 32 bytes per cycle gives ~122 GB/s at 3.8GHz. Double that if the fabric happens to be broad enough to permit the same throughput to the other CCX, and assume the fabric is fast enough to not be the bottleneck. I'm not sure with the examples we have right now what the console fabric would look like, as mobile APUs have an earlier bottleneck due to far lower bandwidth memory.
Unless you redesign the Zen CCX, the CPU portion of the system has max bandwidth below that of the slow section of RAM.

mrcorbo · Apr 8, 2020

Shifty Geezer said:
It'll be to do with the bus and controller and stuff. Someone clever will be able to explain it properly. But in short, having one low-speed access route for all the RAM, and one high speed access route for a subset of the chips, gave a good cost compromise by using the particulars of accessing one chip or two and other things. You can be sure that the cost to allow all chips to access all RAM at full speed wasn't minimal otherwise it would have been done. This added complexity isn't advantageous to the system over unified faster access, so the only reason for it as economics where it must be considerable enough to be worth the hassle.

@Fafalada said (on Resestera) that it might be that the CPU is more sensitive to variable RAM bandwidth.

And I wonder if it's a Lockhart compatibility thing. If Lockhart were to have 12GB of RAM in a 6X2 configuration for a 192-bit interface, then this would allow some consistency between the two devices on the CPU side, allowing developers to focus mainly (solely) on the GPU differences when developing for the two SKUs.

Shifty Geezer · Apr 8, 2020

iroboto said:
So do you think they have 1 set of controllers that can only ever access 6 chips at a time, and another set that can access 4 chips? So the GPU calls both sets of controllers and the CPU only calls one?

Or is there some other setup that is makes more sense?

I don't know. I've never understand RAM buses and chips.

The one thing we can be sure of is if it was better to go single pool, single bus, MS would have done just that.

General Next Generation Rumors and Discussions [Post GDC 2020]

vjPiedPiper

pTmdfx

Deleted member 11852

Guest

pjbliverpool

B3D Scallywag

Globalisateur

Globby

chris1515

Globalisateur

Globby

Shifty Geezer

uber-Troll!

Betanumerical

Shifty Geezer

uber-Troll!

Deleted member 11852

Guest

Betanumerical

iroboto

Daft Funk

Shifty Geezer

uber-Troll!

Betanumerical

Shifty Geezer

uber-Troll!

iroboto

Daft Funk

3dilettante

mrcorbo

Foo Fighter

Shifty Geezer

uber-Troll!

Similar threads