Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

Pretty sure it can but doesn't need to, like shown in Radeon SSGs with direct connected SSDs.
In that slide System DRAM is just another memory it can connect to, among NVRAM, Network Storage etc
I think the difference is data has to be put into HBC first in order for the GPU to utilize it. where as Nvidia data request is pushed directly into GPU's memory (avoiding system memory and CPU).
 
I think the difference is data has to be put into HBC first in order for the GPU to utilize it. where as Nvidia data request is pushed directly into GPU's memory (avoiding system memory and CPU).
That slide can be a little misleading, that HBC in the slide is GPUs memory (the HBM2). According to AMD HBCC allows Vega to treat other memories as video memory and HBM2/HBC as LLC, so pulling stuff directly from whatever other memories are attached straight to it's HBM2, just like NVIDIA plans to pull stuff straight to it's GPUs memory with RTX IO
 
"GPU based decompression capability" is nothing but a program running on shaders of the GPU, what's to advertise there?

IF the RTX-IO solution of bypassing the CPU and system memory to do decompression on the GPU isn't a standard feature of Direct Storage then its not going to simply happen on its own if the end user happens to be using an AMD GPU. Developers would need to specifically target the functionality through a dedicated API just like RTX-IO.

AMD have announced nothing about this.
 
That slide can be a little misleading, that HBC in the slide is GPUs memory (the HBM2). According to AMD HBCC allows Vega to treat other memories as video memory and HBM2/HBC as LLC, so pulling stuff directly from whatever other memories are attached straight to it's HBM2, just like NVIDIA plans to pull stuff straight to it's GPUs memory with RTX IO

Treating main memory as video memory is the main feature of AGP, which is more than two decades old.
What RTX IO (and other similar technologies) is, again, is always supported by PCIe, it's just not that commonly used on desktop environment. The potential problem in desktop environment is that desktop CPU normally do not have that many PCIe channels so peripherals have to connect to a south bridge chip, which then connect to the CPU through proprietary links. Therefore, if a NVMe SSD is connected to the south bridge instead of to the CPU directly, there could be routing problems. It does not necessarily mean that it can't be done in such system, but it might have to be specifically configured, or worse, only available on some south bridge chips.
 
Treating main memory as video memory is the main feature of AGP, which is more than two decades old.
What RTX IO (and other similar technologies) is, again, is always supported by PCIe, it's just not that commonly used on desktop environment. The potential problem in desktop environment is that desktop CPU normally do not have that many PCIe channels so peripherals have to connect to a south bridge chip, which then connect to the CPU through proprietary links. Therefore, if a NVMe SSD is connected to the south bridge instead of to the CPU directly, there could be routing problems. It does not necessarily mean that it can't be done in such system, but it might have to be specifically configured, or worse, only available on some south bridge chips.

Main memory is old, yes, but it doesn't need to be main memory for HBCC, it can be anything from main memory to a storage drive behind a network. And even for NVIDIA alone RTX IO isn't completely new, they already have GPUDirect Storage on the professional side.
 
Treating main memory as video memory is the main feature of AGP, which is more than two decades old.
What RTX IO (and other similar technologies) is, again, is always supported by PCIe, it's just not that commonly used on desktop environment. The potential problem in desktop environment is that desktop CPU normally do not have that many PCIe channels so peripherals have to connect to a south bridge chip, which then connect to the CPU through proprietary links. Therefore, if a NVMe SSD is connected to the south bridge instead of to the CPU directly, there could be routing problems. It does not necessarily mean that it can't be done in such system, but it might have to be specifically configured, or worse, only available on some south bridge chips.
I don't think that will be an issue. Otherwise, too many systems won't be able to take advantage of DirectStorage, which would hurt its adaption very much so. Same as with requiring a PCIe Switch, which surely won't happen.

Most systems have their NVMe PCIe routed through the PCH and not directly connected to the CPU, especially on the Intel side. Frankly, I don't see how this can potentially become an issue anyway . DMA should work on any configuration and the target of reducing the CPU overhead should be achieved regardless if the NVMe drive is connected to CPU or PCH.
 
The ProSSG may have an SSD connected directly to the GPU but I've seen no claims of real time compute shader based decompression from it.
Because shader-based texture compression tools did not exist at the time of Vega announcement?

BCPack works in a two-stage encoding process, with proprietary lossy block texture compression that can be transformed to standard BCn, and additional lossless LZ (Deflate) compression step to squeeze some additional space. I'm still not convinced that LZ compression could be efficiently handled with shaders, but transforming between different lossles block texture compression algorithms should be quite possible in realtime.

they had to physically connect an SSD to the GPU to make it work.
It makes no difference - the NVMe SSD in the Radeon Pro SSG was connected to the GPU PCIe lanes with a PCIe Switch to support peer-to peer (P2P) DMA between SSD and GPU, but AMD Zen processors have built-in PCIe Switches on the CPU Root Complex ports, as we have previously discovered.

The SSD would still be visible to the OS and the driver stack - however to support P2P DMA the video card driver would need to handle block I/O from NVMe SSDs (and RDMA network cards) - AFAIK this is not currently possible in WDDM 2.x and StorPort/StorNVMe driver models.


we don't yet know the system requirements for DirectStorage.
We do know that an NVMe SSD is required.

I'd guess they could also recommend PCIe Resizeable BAR and BAR Size of 4GB+ (aka AMD Smart Access Memory), but requiring this will severely limit the supported base to Zen3 systems.

AMD haven't announced any GPU based decompression capability
They do support BCPack compresion on the Xbox Series X, which is a companion to DirectStorage in the 'XBox Velocity Architecture'. Microsoft did not yet announce BCPack tools for Windows.


if a NVMe SSD is connected to the south bridge instead of to the CPU directly, there could be routing problems
Only if they would require peer-to-peer (P2P) DMA, i.e. direct transfers from NVMe SSD to GPU video memory, but that would limit the installed base to recent Zen and Xeon processors.

Regular bus-master DMA works just fine with chipset PCIe ports.
 
Last edited:
I wonder, which role will DDR RAM play when DirectStorage is finally in use? With DS, assets and textures can be streamed directly to GPU memory, bypassing DRAM and the CPU. That is a drastic difference comparing how it is today, going from storage, to CPU/DRAM, and only then to VRAM.

I think one thing is for sure, the PC will continue to use the DRAM for the CPU, while on consoles, CPU memory and GPU memory is shared, as always. I don't think that will change.

For graphics however, what will happen when the VRAM overflows? Currently, DRAM is used as a cache for graphics stuff, but with DS, that would be the SSD now, right? Which is much slower than DRAM. However, same applies for the consoles, so I assume decompression and Sampler Feedback techniques will help here tremendously, preventing stuttering and pop in when assets and textures from the SSD load into GPU memory.

In that case, would it be also possible to use DRAM as a buffer for VRAM again for certain assets and textures, like a super-charged SSD? So the question is would the PC be able to switch between the current RAM/VRAM management and DirectStorage approach (bypassing DRAM) seamlessly?
 
I do expect that AMD will support GPU decompression, but this isn't it.
This isn't done on the gpu using shaders etc.
It's a dedicated custom block.

In all the history of GPUs, or any larger chip really, there are very few instances of fixed-function provision of decompression functionality. Basically all but one have been added after decades of plannung or use of the same coding scheme. The lossy block-codes are naturally consortium driven, for a variety of reasons, among them patent issues, as well as longevity concerns. That's why these procedures are normally driven by standard bodies like the ISO or Khronos. STC, ETC and ASTC are good examples of this. As well as video codes using arithmetic coders or whatnot. This is a lesson from the arithmetic coder wars of the 80s and 90.
Other lossless variants are almost always only used when being in embedded systems. The GPU in a PS4 is embedded in that context, it can't change, it's always there; and likewise products on the PS4 are sort of embedded software, because it only works on that machine. In this context you can do whatever you as a product producer want, because you own it, and it'll never change. This is why Kraken is possible (that easily), and on other chips LZ77 or Huffman or whatever you fancy.

Because the creator of stuff is responsible for the encoding of data, it's enormously difficult to introduce changes to a system wired in a fixed way. While there is the possibility to transcode from one coding to the next, I've never heard of transcoding hardware really (zip to rar? huffman to arithmetic?). The problem in general is, that encoding is a at times very hard problem and a general purpose CPU is perfectly suited to do it in within a couple of hours, with often large memory use. Not the stuff for a hardware block.
A graphics card is replaceable, and people tend to replace it fairly often, so whatever scheme you come up with, should *never* change. At all. Because you can't transcode later realistically, or you just question the usefulness of a proprietary scheme, when on most of the replaced hardware it falls back to the general purpose CPU.

GPU decompression schemes have been around since GPGPU became a term, unsurprisingly the "G" for general reflect here in that there were actual transcoders run on GPUs (JPEG-XR to BTC in Rage). Huffmans and arithmetic coders are trivially runnable in GB/s excess on a computer shader since a long while. It's an easy to investigate topic.

Now, I see basically no reason, why a developer would go through the hassle of compression data offline in a proprietary scheme, which might disappear with the next card, or become utterly bad relative to new inventions. Nobody wants to be locked in on a non-embedded platform. Nvidia developers knows this very well, their not amateurs and have intimate knowledge about the tradeoffs and the history of compression schemes in chips (they are part of it). Even something as innocent as a (hypothetical or accidentially guessed :)) NVlink peer-to-peer compression scheme, has major ramifications for product design, in regards to compatibility and maintenance aspects.

Any type of compression scheme on something like a PC, would either go through a standard body a decade before it's in a chip, or it would be in software. Whatever compute based solution is shipped with a game is embedded in that product, so if you don't update your product, you are free of any side-effects. If you do, all the same applies, but at least it's easy to transcode, because it's all software.

The GPU compressor Nvidia mentions, might be an original invention (and might even only be practical with tensor core are whatnot), but it is unlikely anything done with dedicated hardware units; and is entirely embedded in the products using it.
 
I wonder, which role will DDR RAM play when DirectStorage is finally in use? With DS, assets and textures can be streamed directly to GPU memory, bypassing DRAM and the CPU. That is a drastic difference comparing how it is today, going from storage, to CPU/DRAM, and only then to VRAM.

I think one thing is for sure, the PC will continue to use the DRAM for the CPU, while on consoles, CPU memory and GPU memory is shared, as always. I don't think that will change.

For graphics however, what will happen when the VRAM overflows? Currently, DRAM is used as a cache for graphics stuff, but with DS, that would be the SSD now, right? Which is much slower than DRAM. However, same applies for the consoles, so I assume decompression and Sampler Feedback techniques will help here tremendously, preventing stuttering and pop in when assets and textures from the SSD load into GPU memory.

In that case, would it be also possible to use DRAM as a buffer for VRAM again for certain assets and textures, like a super-charged SSD? So the question is would the PC be able to switch between the current RAM/VRAM management and DirectStorage approach (bypassing DRAM) seamlessly?

It's certainly possible to load compressed texture data from DRAM, just like loading from the SSD. Actually it'd be less likely to have compatibility problems as GPU can already load data directly from main memory in all current systems.
It shouldn't be hard for DirectStorage to support existing caching mechanism in the OS, so if the main memory have enough space it won't have to read the same data twice from the SSD.
 
In all the history of GPUs, or any larger chip really, there are very few instances of fixed-function provision of decompression functionality. Basically all but one have been added after decades of plannung or use of the same coding scheme. The lossy block-codes are naturally consortium driven, for a variety of reasons, among them patent issues, as well as longevity concerns. That's why these procedures are normally driven by standard bodies like the ISO or Khronos. STC, ETC and ASTC are good examples of this. As well as video codes using arithmetic coders or whatnot. This is a lesson from the arithmetic coder wars of the 80s and 90.
Other lossless variants are almost always only used when being in embedded systems. The GPU in a PS4 is embedded in that context, it can't change, it's always there; and likewise products on the PS4 are sort of embedded software, because it only works on that machine. In this context you can do whatever you as a product producer want, because you own it, and it'll never change. This is why Kraken is possible (that easily), and on other chips LZ77 or Huffman or whatever you fancy.

Because the creator of stuff is responsible for the encoding of data, it's enormously difficult to introduce changes to a system wired in a fixed way. While there is the possibility to transcode from one coding to the next, I've never heard of transcoding hardware really (zip to rar? huffman to arithmetic?). The problem in general is, that encoding is a at times very hard problem and a general purpose CPU is perfectly suited to do it in within a couple of hours, with often large memory use. Not the stuff for a hardware block.
A graphics card is replaceable, and people tend to replace it fairly often, so whatever scheme you come up with, should *never* change. At all. Because you can't transcode later realistically, or you just question the usefulness of a proprietary scheme, when on most of the replaced hardware it falls back to the general purpose CPU.

GPU decompression schemes have been around since GPGPU became a term, unsurprisingly the "G" for general reflect here in that there were actual transcoders run on GPUs (JPEG-XR to BTC in Rage). Huffmans and arithmetic coders are trivially runnable in GB/s excess on a computer shader since a long while. It's an easy to investigate topic.

Now, I see basically no reason, why a developer would go through the hassle of compression data offline in a proprietary scheme, which might disappear with the next card, or become utterly bad relative to new inventions. Nobody wants to be locked in on a non-embedded platform. Nvidia developers knows this very well, their not amateurs and have intimate knowledge about the tradeoffs and the history of compression schemes in chips (they are part of it). Even something as innocent as a (hypothetical or accidentially guessed :)) NVlink peer-to-peer compression scheme, has major ramifications for product design, in regards to compatibility and maintenance aspects.

Any type of compression scheme on something like a PC, would either go through a standard body a decade before it's in a chip, or it would be in software. Whatever compute based solution is shipped with a game is embedded in that product, so if you don't update your product, you are free of any side-effects. If you do, all the same applies, but at least it's easy to transcode, because it's all software.

The GPU compressor Nvidia mentions, might be an original invention (and might even only be practical with tensor core are whatnot), but it is unlikely anything done with dedicated hardware units; and is entirely embedded in the products using it.
Did you quote me by mistake?

Just rereading my post, did you think I was saying that AMD GPU had fixed function hardware? If so, I meant that I expect them to support GPU decompression via compute using CU's.
I was specifically correcting the fact that XSX|S actually does have hardware decompression blocks.
 
Cheaper larger ssd's reviewed: https://www.anandtech.com/show/16136/qlc-8tb-ssd-review-samsung-870-qvo-sabrent-rocket-q

That samsung sata ssd feels like really decent mass storage/legacy games drive. 4TB for about 400$ or double the price and capacity. I have high hopes my next pc build in 1.5 years or so would need no more spinny disks. Once zen4 is widely available I will push buy button. My old build from 2013 is starting to feel old. I'll probably buy something crazy as it seems lifetime of good pc is very, very long.

Is there any news on intel pcie gen4 optane drives? There was only some tweet earlier this year that they do exists but no concrete public information? Something like 500GB optane for boot+apps, fast ssd for next gen games and sata ssd for mass storage feels like ultimate somewhat reasonable build.
 
Did you quote me by mistake?

Just rereading my post, did you think I was saying that AMD GPU had fixed function hardware? If so, I meant that I expect them to support GPU decompression via compute using CU's.
I was specifically correcting the fact that XSX|S actually does have hardware decompression blocks.

My bad, take it as reinforcement. :)
 
This isn't done on the gpu using shaders etc.
It's a dedicated custom block.
I expect them to support GPU decompression via compute using CU's.

AFAIK the custom block in the XBOX storage controller only handles LZ-family compression part, and lossy texture compression part is handled by shaders which decode it to standard BC1-BC7 (S3TC/DXT) formats.

Either way RDNA2 ISA specification does not mention new resource compression formats beyond BC1-BC7.


With DS, assets and textures can be streamed directly to GPU memory, bypassing DRAM and the CPU. For graphics however, what will happen when the VRAM overflows?
would the PC be able to switch between the current RAM/VRAM management and DirectStorage approach (bypassing DRAM) seamlessly?
DirectStorage does not bypass anything on the XBOX - there is no DRAM or VRAM in the first place, it's unified memory that is shared between CPU and GPU. Details of Windows 10 implementation are still scarce but I don't think Microsoft announced anything about 'bypassing system RAM' as well.
Beta-testing will probably begin in 2021 with the recent Cobalt Insider Preview branch (builds 212xx) - so far there were changes to NVMe StorPort driver which add support for size/granularity hints in NVMe 1.3 and ZNS in NVMe 2.0. More details will come with the release of Cobalt Insider Preview SDK/WDK 212xx.

My bet is WDDM will be updated to plug into I/O Manager / Cache Manager subsystem and to handle disk I/O requests directly from StorNVMe driver. DirectStorage part would provide new DDIs and APIs optimized for NVMe I/O patterns, reducing disk access time and improving transfer rate, while BCPack LZ-family part would be handled by exising LZX 'CompactOS' compression fiter driver (which could be plugged to a dedicated hardware implementation in a WDDM driver if required).


it's enormously difficult to introduce changes to a system wired in a fixed way.
general purpose CPU is perfectly suited to do it in within a couple of hours, with often large memory use.
I see basically no reason, why a developer would go through the hassle of compression data offline in a proprietary scheme

True. LZ77 (ZIP/LZMA/LZX) define basic dictionary-based stream formats, where the encoding algorithm can be tuned for lower decompression complexity using larger dictionary size and massively parallel processing to find more patterns. Thus a proprietary implementation can extract better efficiency from the same basic stream format.

This is the way Oodle Leviathan/Kraken tools work, and Sony basically licenced Oodle Kraken for its proprietary implementations of LZ77 encoder and decoder that can run very fast on embedded CPU cores (see this forum post by the author of the ooz project).

It's certainly possible to load compressed texture data from DRAM, just like loading from the SSD. Actually it'd be less likely to have compatibility problems as GPU can already load data directly from main memory in all current systems.
It shouldn't be hard for DirectStorage to support existing caching mechanism in the OS
Yes, typical PCs have two or four times as much system RAM comparing to video memory, and dual-channel DDR4 configuration is like two orders of magnitude faster than current NVMe SSDs. So system RAM will still be used to hold graphics resources and swap them to video memory as necessary.

Furthermore PCIe 6.0 provides 126 GB/s, four times the bandwidth of PCIe 4.0 - this would match the bandwidth of dual-channel DDR5.
 
Last edited:
AFAIK the custom block in the XBOX storage controller only handles LZ-family compression part, and lossy texture compression part is handled by shaders which decode it to standard BC1-BC7 (S3TC/DXT) formats.

Either way RDNA2 ISA specification does not mention new resource compression formats beyond BC1-BC7.
That's because your looking at RDNA2.
Its custom, last sentance : 2 general + custom texture decompression
202008180212441.jpg
 
Still not enough details, this could mean a variant of LZ-family.

No, I simpy said that BCPack has to decode to standard BCn since that's what RDNA2 supports.
Was going to edit my post, but already replied.
Yes, it will decode into supported format.
I would need to rewatch the video or find transcript to see what is explicitly said.
But LZ is the general decompression. The other specially states texture. So its obviously specifically tailored for texture decompression regardless of format.
And the discussion was about hardware texture decompression.
 
The other specially states texture. So its obviously specifically tailored for texture decompression regardless of format.
Only if they referred to lossy block texture compression like in BCn (DXT/S3TC) format, which they didn't.
OTOH lossless 'texture decompression' would be a lossless transform step taken prior to general (dictionary-based) LZ-family compression, like the above-mentioned BC7Prep step - this would align with other hash-based algorithms handled by this processor.
 
Back
Top