Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

Shifty Geezer · Mar 27, 2022

Remij said:
Yes, there is a reason... it's the most efficient way to do it, considering how cost effective consoles have to be. On PC there are other ways of mitigating those differences.. by actually programming games for the PC architecture's strengths... wide buses, and high capacities. PCs will never have console level efficiencies in design.. we know that.

That's actually a notable problem IMO. Basically you have to spend more money, more silicon, to achieve the same sort of thing. Gaming on PCs tended to be having a powerful computer for home computing, and then play games on it. But now processing has eclipsed the needs of the home computing environment, a lot of the push for bigger, faster, better PCs is just for gaming. At which point, trying to build on such archaic legacy ideas is...really wasteful!

The consoles represent a next-gen IBM architecture in reality. They are flexible computers that could, with the right software, do video editing, gaming, browsing and Office, but more efficiently. Without being tied to legacy software, the new microarchs could give a breath of fresh air to the computing space. With the work having been done, it'd be nice to be able to roll these designs back into the Windows space so it too can use super efficient file access, etc. What we have instead is an argument that PC can just approach all problems like the Industrial Revolution, with just bigger, faster, more, eating more electricity, and having only clumsy, brute force solutions.

pjbliverpool · Mar 27, 2022

DSoup said:
Nvidia's description seems to describe a different model:

GPUDirect Storage enables a direct data path between local or remote storage, such as NVMe or NVMe over Fabric (NVMe-oF), and GPU memory. It avoids extra copies through a bounce buffer in the CPU’s memory, enabling a direct memory access (DMA) engine near the NIC or storage to move data on a direct path into or out of GPU memory — all without burdening the CPU.

?? That's describing exactly the same thing I did. Where are you seeing a difference? I described both the lack of need to copy data to main memory (It avoids extra copies through a bounce buffer in the CPU’s memory) and the use of the NVMe drives own DMA engine to handle the data copy (enabling a direct memory access (DMA) engine near the NIC or storage to move data).

I think you're getting hung up on the use of the term "direct path". This isn't referring to a physical bit of routing that runs directly between the NVMe and SDD that completely bypasses the CPU, but rather a direct path between the two that doesn't have to be copied in and out of system memory first. It obviously still has to go via the PCIe root complex though and the CPU/OS/application still needs to tell the DMA engine what and when to copy.

Since around 2011, AMD and Intel began incorporating south-bridge and north-bridges controllers on the main CPU die itself. The bus controllers very much still exist and their features still advance filling the need to support various new I/O models. The distinct logic blocks and interconnects continue still exist on-die, i.e. there are still but controllers for the different devices that can be connected. The integration was why CPU pin-counts exploded because the motherboard suddenly had a lot of more signals crowding into one chip.

The North Bridge is/was specifically a separate chip on the motherboard that contained functions such as the memory controller, PCIe graphics interface and iGPU. Those functions have now been integrated in to the CPU itself but the North Bridge as a separate chip has gone the ay of the dinosaur. This is relevant to what you were saying earlier because you were implying that the PC (as opposed to consoles) had to go through some additional burdensome process of transferring the data through the Northbridge which is simply not correct. Also, the South bridge still exists as a separate chip, it has not been integrated into the CPU, except arguable the NVMe interface which is why the South Bridge shouldn't factor into this conversation.

pjbliverpool · Mar 27, 2022

Shifty Geezer said:
At which point, trying to build on such archaic legacy ideas is...really wasteful!

The consoles represent a next-gen IBM architecture in reality. They are flexible computers that could, with the right software, do video editing, gaming, browsing and Office, but more efficiently.

Are you talking in terms of software stack or hardware architecture? As if it's the latter then as per my conversation with Dsoup above I don't think this is a correct interpretation. Take a modern Zen2 APU based PC with an NVMe drive and Direct Storage (in it's fully implemented form) for example. At a hardware level, what you have there is essentially a PS5 with a smaller GPU and data decompression done on the GPU rather than a dedicated hardware unit (arguably a more elegant future looking design than using a fixed function hardware unit). The rest of the architecture is pretty much the same aside from some minor hardware amendments (e.g. cache scrubbers). Adding a dGPU in complicates things but it's necessary not because the PC is architected on legacy ideas but rather because it's a requirement for the modular nature of the PC (which is arguably it's key strength) and for going beyond the power and performance limits inherent to an APU.

I'm also not convinced on the software side tbh. The x86 instruction set may be old, but it's also central to the consoles as well as PC. And if you look at the OS, Graphics API, and now storage API, the PC is for the most part using the same code base as Xbox Series. So it's difficult to refer to one as being based on archaic legacy ideas and the other as being a next gen software architecture.

Deleted member 11852 · Mar 27, 2022

pjbliverpool said:
?? That's describing exactly the same thing I did. Where are you seeing a difference? I described both the lack of need to copy data to main memory (It avoids extra copies through a bounce buffer in the CPU’s memory) and the use of the NVMe drives own DMA engine to handle the data copy (enabling a direct memory access (DMA) engine near the NIC or storage to move data). I think you're getting hung up on the use of the term "direct path".

Maybe I'm reading Nvidia's own implementation differently, but what they are describing is something new. Not doing something different.

NVME-oF Target Offload is an implementation of the new NVME-oF standard Target (server) side in hardware. Starting from ConnectX-5 family cards, all regular IO requests can be processed by the HCA, with the HCA sending IO requests directly to a real NVMe PCI device, using peer-to-peer PCI communications. This means that excluding connection management and error flows, no CPU utilization will be observed during NVME-oF traffic.

I'm not "hung up" on Nvidia using 'direct path' but do believe when Nvidia use 'direct path' they mean a direct path. In your earlier post you said:

pjbliverpool said:
Nvidia are describing the transfer of data from SDD to GPU memory over the existing PCIe fabric and via the CPU's root complex

But all of Nvidia's text says the CPU is not involved at all. ¯\_(ツ)_/¯

pjbliverpool said:
The North Bridge is/was specifically a separate chip on the motherboard that contained functions such as the memory controller, PCIe graphics interface and iGPU.

Yes, that's the northbridge. The northbridge and south bridge were called 'bridges' because they were the CPUs bridges (bus) to everything else. These chipsets were traditionally separate both for thermal reasons and because it simply wasn't practical to have that many pins on the CPU socket. What you think of as part of the CPU is a diverse range of traditional distinct packages with varying features, which in many cases Intel still produce separate product briefs for. You don't think of them are separate products because in consumer space they're just part of Intel's vast array of packages for i7, and i9 chips.

None of that is really important because the way the processor part of the die connects to the external bus part of the die is not that different from when the separate processor chip was connected to the separate northbridge via the FSB. Intel still treat this packages are distinct from the CPU and the way they are laid out on the die visibly shows this. As do all logic diagrams of Intel CPUs. All of the I/O functions are still spread over two distinct I/O blocks - what on-die modern equivalents of the separate south bridge and northbridge chips.

It's no different to modern consoles having CPU and GPU on the same die. The Radeon GPU doesn't cease to be a Radeon GPU logic block because it shares a die with a Zen CPU block. They are not intrinsically connected other than the established buses between them on-die, nor is cache coherence any less complicated because they're on the same chip. It's still two distinct chunks of silicon with specific paths connecting them on a single die.

pjbliverpool · Mar 27, 2022

DSoup said:
Maybe I'm reading Nvidia's own implementation differently, but what they are describing is something new. Not doing something different.

NVME-oF Target Offload is an implementation of the new NVME-oF standard Target (server) side in hardware. Starting from ConnectX-5 family cards, all regular IO requests can be processed by the HCA, with the HCA sending IO requests directly to a real NVMe PCI device, using peer-to-peer PCI communications. This means that excluding connection management and error flows, no CPU utilization will be observed during NVME-oF traffic.

I'm not "hung up" on Nvidia using 'direct path' but do believe when Nvidia use 'direct path' they mean a direct path. In your earlier post you said:

But all of Nvidia's text says the CPU is not involved at all. ¯\_(ツ)_/¯

Okay let me clarify as you're right that my use of the term via the root complex was misleading. In a P2P DMA, the data does not have to go via the root complex if the devices involved in the memory transfer sit under the same switch, this may be the case for the data centre setups that GPUDirect Storage is designed for. However it's irrelevant for discussing consumer PC's where the the GPU and NVMe drive sit on entirely different root ports. Therefore although the P2P data transfer can still go direct from NVMe to GPU without going through system memory it must still go via the root complex on the CPU itself. The point here is that there is no need to "add a direct bus between the GPU and storage that would be bypassing the north/south-bridges and the operating system". P2P DMA does not require that, and is already supported in modern PC hardware. Its the software side that's lacking, but RTX-IO looks like it might be addressing that.

Yes, that's the northbridge. The northbridge and south bridge were called 'bridges' because they were the CPUs bridges (bus) to everything else. These chipsets were traditionally separate both for thermal reasons and because it simply wasn't practical to have that many pins on the CPU socket. What you think of as part of the CPU is a diverse range of traditional distinct packages with varying features, which in many cases Intel still produce separate product briefs for. You don't think of them are separate products because in consumer space they're just part of Intel's vast array of packages for i7, and i9 chips.

None of that is really important because the way the processor part of the die connects to the external bus part of the die is not that different from when the separate processor chip was connected to the separate northbridge via the FSB. Intel still treat this packages are distinct from the CPU and the way they are laid out on the die visibly shows this. As do all logic diagrams of Intel CPUs. All of the I/O functions are still spread over two distinct I/O blocks - what on-die modern equivalents of the separate south bridge and northbridge chips.

It's not different to modern consoles having CPU and GPU on the same die. They are not intrinsically connected other than the established buses between them on-die, nor is cache coherence any less complicated because they're on the same chip. It's still two distinct chunks of silicon with specific paths connecting them on a single die.

Again, the Northbridge was an entirely separate chip. Part of the reason it's functionality got integrated into the CPU is because the FSB was becoming a bottleneck. So no, your assertion that the connectivity between the different parts of the CPU is "not that different" from when the Northbridge was a separate chip is simply wrong. The bandwidth and latency are orders of magnitude apart.

But anyway, we're getting stuck in the weeds on this as the reason I raised it in the first place was because of your implication that data having to be "passed to the bus controller in the north-bridge, [and then] routed onwards" was a step the consoles somehow didn't have to contend with. And that's wrong. Whether it's console or modern PC, the trip is the same. The data is sent from NVMe drive over PCIe 4x to the CPU root complex and from there to main memory. The difference comes after that where on PC the data has to be forwarded again from system memory to GPU memory in systems utilising a dGPU. But as mentioned above, RTX-IO may remove the requirement for that additional forward by allowing the data to P2P DMA direct from NVMe to GPU memory - again, still via the root complex though.

PSman1700 · Mar 27, 2022

chris1515 said:
Amount of memory and memory BW contention is not a limit of shared memory but of console cost. This is possible from a technical point of view to have 32 GB of faster GDDR6 with 512 bits bus but the cost is too high for a console.

I wrote before its architectures are limited by cost-savings. If they had 32GB of faster GDDR then yeah, but they dont. Now they share a total of 16GB at 448gb/s for the whole system. I'd guess that anywhere between 4 and 6gb is going to be used for non-vram purposes and a good chunck of the BW for the CPU in most games. Atleast if games are going for something else than say tokyo ghostwire experiences. On pc you'd be looking at 700/800gb/s for the GPU alone, at up to 16GB or more. AMD's gpus have ultra fast Infinity Caches, which is another advantage.
Also in many cases CPU work actually is better with DDR4/5 ram for lower latency.

Shifty Geezer said:
That's actually a notable problem IMO. Basically you have to spend more money, more silicon, to achieve the same sort of thing. Gaming on PCs tended to be having a powerful computer for home computing, and then play games on it. But now processing has eclipsed the needs of the home computing environment, a lot of the push for bigger, faster, better PCs is just for gaming. At which point, trying to build on such archaic legacy ideas is...really wasteful!

The consoles represent a next-gen IBM architecture in reality. They are flexible computers that could, with the right software, do video editing, gaming, browsing and Office, but more efficiently. Without being tied to legacy software, the new microarchs could give a breath of fresh air to the computing space. With the work having been done, it'd be nice to be able to roll these designs back into the Windows space so it too can use super efficient file access, etc. What we have instead is an argument that PC can just approach all problems like the Industrial Revolution, with just bigger, faster, more, eating more electricity, and having only clumsy, brute force solutions.

Already covered by someone else, not seeing anything close to what reality is in your post. The difference between PC and consoles in hardware terms has never been as close as it is today, the notion that its all a waste, where everything is dependend on legacy, moar power draw, clumsy and brute force apply as much to consoles as to PC. Quite funny how you write this but looking at the PS5, its bigger then ever (biggest PS ever made), draws more power then anyone would have could have thought, and, they are still on the x86 and RDNA1.5 (and 2.0) architectures. They are having the APU (like laptops) design mainly due to cost vs efficiency and power.

This 'IBM architecture' applies to the consoles aswell, they are basically using PC architectures from AMD which are packaged together as cost effective as possible. And no i dont think pc gamers generally want the console design as its limited in many areas, APU designs are much more limited in GPU/CPU power (they are limited to low-end range in the consoles now), you cant upgrade them either which would mean swapping whole systems.

IMO, the true 'departure' from the old, power hungry, bad, legacy x86 would be what Apple/arm is doing, not the consoles or gaming laptops. Or so have i been told.

pjbliverpool said:
But anyway, we're getting stuck in the weeds on this as the reason I raised it in the first place was because of your implication that data having to be "passed to the bus controller in the north-bridge, [and then] routed onwards" was a step the consoles somehow didn't have to contend with. And that's wrong. Whether it's console or modern PC, the trip is the same. The data is sent from NVMe drive over PCIe 4x to the CPU root complex and from there to main memory. The difference comes after that where on PC the data has to be forwarded again from system memory to GPU memory in systems utilising a dGPU. But as mentioned above, RTX-IO may remove the requirement for that additional forward by allowing the data to P2P DMA direct from NVMe to GPU memory - again, still via the root complex though.

We will see, maybe they are right about, but if they are not and the differences are minimal or basically un-noticable, or even worse that the pc might actually load and stream things faster in like-for-like scenarios (hw/application) then its again a nice topic to read back on. Just like all the noise two years ago prior Direct Storage.

Allandor · Mar 28, 2022

PSman1700 said:
IMO, the true 'departure' from the old, power hungry, bad, legacy x86 would be what Apple/arm is doing, not the consoles or gaming laptops. Or so have i been told.

This is a bit OT, but Apple/ARM is not really a miracle-CPUs/APUs. It is just a transistor juggernaut. This CPU/APU is just so big (and yes I know it also has a GPU on bord) that the transistor count is enough for a high end AMD CPU and a high end RTX card and there would be still transistors left. Apple just produced a really, really big APU. This would be nowhere cost-effective but in premium-priced devices. Intel/AMD would never produce such big chips in quantities that are needed for the mass market with difference performance profiles. It would just be impossible to create cheap systems with such a big chip. And AMD/Intel CPUs must scale from really cheap CPUs to high performance server CPUs. So they just can't afford this apple-closed-system-premium-price design.
It is an especially bad design for the console space, except if consoles would be accepted at costs >>>>500€/$. But even than, the transistors could be better used for more "game-effective" logic.

Shifty Geezer · Mar 28, 2022

PSman1700 said:
Already covered by someone else, not seeing anything close to what reality is in your post.

Then why did you like Remji's post which I responded to? :???:

Remji said:

PCs will never have console level efficiencies in design.. we know that

Which you gave the thumbs up. You mean you disagree with that post and the ideas presented, to which I responded, that we can just use more and faster RAM on PC to workaround the console efficiencies?

I stated : "With the work having been done, it'd be nice to be able to roll these designs back into the Windows space so it too can use super efficient file access, etc."

Hence if that's happening, cool! Seems maybe it is and I've misunderstood the tech (not really paying that much close attention!). That doesn't take away from my point about Remji's point that using the Old Ways to bolster the PC space would be a Bad Idea. ¯\_(ツ)_/¯

PSman1700 · Mar 28, 2022

Shifty Geezer said:
Then why did you like Remji's post which I responded to?

Remji said:
Which you gave the thumbs up. You mean you disagree with that post and the ideas presented, to which I responded, that we can just use more and faster RAM on PC to workaround the console efficiencies?

I stated : "With the work having been done, it'd be nice to be able to roll these designs back into the Windows space so it too can use super efficient file access, etc."

Huh, wait let me read the posts again

I 'liked' his post because he both gave the advantages and disadvantages of each platforms. Ie the pc has higher latencies but can move larger amounts of data/faster etc. Also i have seen this happening before that, while i dont necessary agree with everything in a whole post but a some part of it (and stick a like).

Shifty Geezer said:
Hence if that's happening, cool! Seems maybe it is and I've misunderstood the tech (not really paying that much close attention!). That doesn't take away from my point about Remji's point that using the Old Ways to bolster the PC space would be a Bad Idea. ¯\_(ツ)_/¯

Yes, i think that the modern pc as-is isn't too bad of an architecture.... Its quite close to the consoles actually, closer than ever. You cant really do what the consoles do on PC from a hw perspective because in most casses pc's are upgradable and have different configurations possible (to whatever you want). However i do not think its problematic enough to cause problems in (future) games, some extra hops due to some extra paths are routed on pc, but that comes with the platform and probably isnt enough to see it as a large disadvantage i think. Modern hardware seems pretty optimized for what it is. But ofcourse consoles enjoy the efficiency/optimization side of things, their purely designed around gaming afterall without the constraints of different HW configs, upgradability etc etc..
Going to be intresting to see whenever the first game with DS/IO/PS5 etc in mind and see DF do some comparisons.

Deleted member 11852 · Mar 28, 2022

pjbliverpool said:
Again, the Northbridge was an entirely separate chip. Part of the reason it's functionality got integrated into the CPU is because the FSB was becoming a bottleneck. So no, your assertion that the connectivity between the different parts of the CPU is "not that different" from when the Northbridge was a separate chip is simply wrong.

CPU, GPU, audio, cache, RAM and blue logic all used to be separate chips and in many cases now they do not need to be. In many cases it's more efficient to co-locate them on a single, larger die with on-die interconnects than have pins to external connections.

pjbliverpool said:
The bandwidth and latency are orders of magnitude apart.

I cannot find a single benchmark or review that reports Sandybridge CPUs being "orders of magnitude apart" from Westmere CPUs with separate Northbridge, in terms of I/O. So I would really appreciate that claim being evidenced. The last northrbridge chipset that typically coupled with a Westmere CPU supported 16 PCI2.0 lanes and the first Sandybridge supported the exact same 16 PCI2.0 lanes.

Do you know why. Because as Anandtech explained, the "system agent" is just a "fancy name" for the northbridge. Intel just dropped that this right in there.

pjbliverpool · Mar 28, 2022

DSoup said:
I cannot find a single benchmark or review that reports Sandybridge CPUs being "orders of magnitude apart" from Westmere CPUs with separate Northbridge, in terms of I/O. So I would really appreciate that claim being evidenced. The last northrbridge chipset that typically coupled with a Westmere CPU supported 16 PCI2.0 lanes and the first Sandybridge supported the exact same 16 PCI2.0 lanes.

No-one said anything about real world performance of those functions being orders of magnitude apart for integrated vs on die. I said the bandwidth and latency between what was traditionally the CPU and northbridge functions is orders of magnitude better than what they were when these were separate chips. This should be obvious. We're talking about components on the same CPU die (or at least chip in Zens case) vs entirely separate chips talking over a bus that maxed out at about 12GB/s. Infinity fabric on the other hand is measured in hundreds of GB/s.

But as I said above, this is all academic. I only raised the point about the Northbridge being irrelevant because you raised it as some kind of obstacle in the PC architecture that had to be overcome. This is simply wrong. There is no difference in this respect between moderns PC's and consoles.

Deleted member 11852 · Mar 29, 2022

pjbliverpool said:
No-one said anything about real world performance of those functions being orders of magnitude apart for integrated vs on die. I said the bandwidth and latency between what was traditionally the CPU and northbridge functions is orders of magnitude better than what they were when these were separate chips. This should be obvious. We're talking about components on the same CPU die (or at least chip in Zens case) vs entirely separate chips talking over a bus that maxed out at about 12GB/s. Infinity fabric on the other hand is measured in hundreds of GB/s.

So your positions is that Sandybridge was obviously "orders of magnitudes" faster and better than Westmere in ways that neither manifested themselves in measurable ways? And you you think it's sensible to measure the transition point of the northbridge logic being incorporated into the CPU die with modern AMD technologies, like there was no evolutions of technology in the last eleven years? Rather than Westmere to Sandybridge - the transition point.

pjbliverpool said:
But as I said above, this is all academic. I only raised the point about the Northbridge being irrelevant because you raised it as some kind of obstacle in the PC architecture that had to be overcome. This is simply wrong. There is no difference in this respect between moderns PC's and consoles.

Nobody, myself included, said the northbridge - or the need for data to move over buses was an obstacle - just the path the data must take. I get that you like to pretend things that aren't on separate chips don't exist, despite all lithography analysis of Intel CPUs and Intel's own logic diagrams very clearly showing discrete logic functions still existing and there being clear logic path (the bus) twee the CPU and those blocks, but you.. ok. You can die on that hill.

All people are saying is that on consoles, the I/O controller decompresses data (without any need to read and write data to RAM) during the process of transferring from. The data undergoes two stages;. first to the I/O controller, where decompression happens in the on-die cache, then it's written direct to memory. It doesn't get any simpler, smarter or more efficient than that. Having to move data around a bunch of places, reading of compressed data from RAM and writing decompressed data to RAM - is a less efficient approach. But the only one that exists on PC right now.

Shifty Geezer · Mar 29, 2022

There's unnecessary arguing of semantics here. 'Northbridge' has been used to refer to the apparatus of communications whether on a discrete die or on the CPU die. DSoup's argument is that this apparatus provides a limiting factor in communication. I believe, though it needs clarification from him, that he thinks all and any communication between storage and GPU will need to be routed through this apparatus (hereafter referred to for convenience as the 'Northbridge') which slows potential communications down.

To determine if this is a limiting factor or not, it'd be nice to see if location of the NB on the CPU die or in a discrete package had any impact on latency, although that's not important. What is important is how the routing of data from storage to GPU via DirectIO and RTX IO passes through the PC system and, most importantly, what differences that has versus consoles including extra overheads.

RTX IO has this slide:

Is this accurate to how data actually flows? And if so, how is that routing comparable to consoles?

Having to move data around a bunch of places, reading of compressed data from RAM and writing decompressed data to RAM - is a less efficient approach. But the only one that exists on PC right now.

That's an overgeneralisation that's not very conducive to clear arguments ('a bunch of places' - which places in particular? How many at what sort of bottlenecks?). I also think you are wrong (for future RTX IO type dataflow), as AFAICS the data is decompressed on the fly by the GPU. The term 'GPU memory' has been confusing, but I think they are referring as much to on-GPU caches as VRAM. So the data flow, as I understand it, is from IO controller to GPU directly, decompressed into VRAM.

The only slower step is then a copy to CPU RAM for CPU data. If we assume the vast majority of streamed data is graphics data, that shouldn't be a problem, though it will add some overhead for initial world load and set-up.

Of course, the current system is clearly inefficient, reading from IO to CPU to RAM to CPU to decompress to RAM to copy to VRAM, which is where increasing size and speed of everything to compensate as Remji's suggestion isn't a great solution. But Direct Storage and the GPU IHVs seem to be managing to create a more efficient data-flow.

davis.anthony · Mar 29, 2022

Shifty Geezer said:
RTX IO has this slide:

Is this accurate to how data actually flows?

Looking at it no, it's not accurate.

PCIEX lanes are contained within the CPU so any storage device such as an NVME drive would need to connected to the CPU.

So I have no idea why Nvidia are showing the PCIE as a separate block on it's own.

And as far as I know, storage drives in gaming PC's don't connect to NIC's.

So imo that diagram is very wrong.

see colon · Mar 29, 2022

davis.anthony said:
Looking at it no, it's not accurate.

PCIEX lanes are contained within the CPU so any storage device such as an NVME drive would need to connected to the CPU.

So I have no idea why Nvidia are showing the PCIE as a separate block on it's own.

And as far as I know, storage drives in gaming PC's don't connect to NIC's.

So imo that diagram is very wrong.

I think that slide has PCIe as a separate block to indicate that it isn't using CPU cycles, regardless of where the PCIe controller resides. PCIe devices can technically DMA other PCIe devices without going through the CPU or system memory, so I think that's what they are trying to show. As far as the NIC? I don't know, unless they are trying to show that it works with NVMEoF or something. Or perhaps there is another acronym that nVidia is trying to go for that should have been defined on the slide.

PSman1700 · Mar 29, 2022

DSoup have a look where they explain the data paths.

Silent_Buddha · Mar 29, 2022

see colon said:
I think that slide has PCIe as a separate block to indicate that it isn't using CPU cycles, regardless of where the PCIe controller resides. PCIe devices can technically DMA other PCIe devices without going through the CPU or system memory, so I think that's what they are trying to show. As far as the NIC? I don't know, unless they are trying to show that it works with NVMEoF or something. Or perhaps there is another acronym that nVidia is trying to go for that should have been defined on the slide.

Yes, this is basically how GRAID's SupremeRAID works (Solution | GRAID Technology | The Future of NVMe for SSDs ). Linus sensationalizes it a bit in his video...

The "RAID" card being used is basically just an off the shelf NV GPU (you can amusingly even run some games with the "RAID" card rendering the game) based video card manufactured by PNY to which a license for SupremeRAID has been software locked.

It's a RAID card that doesn't require the NVME drives to be connected to the RAID card, instead it offloads the processing from the CPU for anything RAID related to achieve significantly higher RAID transfer rates with significantly lower CPU usage when paired with NVME. CPU based software raid for high speed NVME drives was already faster than Hardware RAID, but uses significant CPU resources.

Basically, while the data touches the CPU as the storage PCIE lanes go through the CPU, the actual CPU cores don't really touch it in any way. Higher levels of RAID do still require some CPU cycles, but significantly lower than without the GPU and RAID software.

So, you can likely see what Microsoft are trying to do with GPU based decompression when it comes, that could potentially almost completely remove the CPU cores from the equation. Data "bounces" off the CPU (via the integrated northbridge) to the GPU for decompression then GPU resources remain on the GPU while CPU resources are redirected to system memory.

That's likely also what NV are doing with their RTX-IO demonstration.

Regards,
SB

PSman1700 · Mar 29, 2022

Im amazed by what MS can do, but then again they have decades of experience in hardware and software engineering. They are attacking every front with results now, dedicated consoles (xss/xsx), PC gaming, live services and finally the most impressive, GamePass. On the console front they really have improved alot since 2013 launch, they have the hardware advantage, software is up to sniff and they got games, which was a problem last time. The PC side has gotten similar-to-console support for nvme/IO streaming (and with the help if NV/AMD its amazing, just like the PS5 in efficiency as shown in LTT video), and everythings on both MS platforms. GamePass was under critics when it was announced but that took off like nothing else now.
Windows 11 even though some dont like it (mostly interface related), as a gaming platform its an improvement over W10 with the design around new API's like direct storage and the integration of android apps.

Shifty Geezer · Mar 29, 2022

davis.anthony said:
Looking at it no, it's not accurate.

PCIEX lanes are contained within the CPU so any storage device such as an NVME drive would need to connected to the CPU.

So I have no idea why Nvidia are showing the PCIE as a separate block on it's own.

Do you mean through the CPU, requiring CPU intervention, or just the CPU die? The term 'CPU' is being used to refer to both the processing unit silicon and the die which can contain other, independent functional units. I understand this diagram to be abstracted to show the CPU processing isn't involved at all in the GPU data access; only the IO interface that resides on the same CPU package.

davis.anthony · Mar 29, 2022

Shifty Geezer said:
Do you mean through the CPU, requiring CPU intervention, or just the CPU die? The term 'CPU' is being used to refer to both the processing unit silicon and the die which can contain other, independent functional units. I understand this diagram to be abstracted to show the CPU processing isn't involved at all in the GPU data access; only the IO interface that resides on the same CPU package.

Through the CPU die, although there still should be some CPU intervention no?

Decompression is only one part and the others such as 'file check in' and other tasks Mark Cerny talked about will still be handled CPU side?

RTX I/O isn't going to completely remove the CPU from having to do work.

What decompresses the files that are not graphics related? Audio for example, is that still done CPU side as normal or does it go GPU for decompression and then CPU > RAM?

The benefits of PS5's set-up is the decompression hardware and I/O complex decompress all game related data in the most straight forward way possible.

Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

Shifty Geezer

uber-Troll!

pjbliverpool

B3D Scallywag

pjbliverpool

B3D Scallywag

Deleted member 11852

Guest

pjbliverpool

B3D Scallywag

PSman1700

Allandor

Shifty Geezer

uber-Troll!

PSman1700

Deleted member 11852

Guest

pjbliverpool

B3D Scallywag

Deleted member 11852

Guest

Shifty Geezer

uber-Troll!

davis.anthony

see colon

All Ham & No Potatos

PSman1700

Silent_Buddha

PSman1700

Shifty Geezer

uber-Troll!

davis.anthony

Similar threads