Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

Do you know what disk fragmentation is? GPUs have the same problem with their memory. This is one of the reasons why Vulkan/DX12 programming is difficult, since developers are now fully exposed to the memory fragmentation problem.
How is it all connected? The original assertion is that just because GPUs do not have dedicated hardware for decompression does not make it inferior; just different and perhaps more specific for its own use case.

how does all of this tie into it being inferior to fixed function hardware?
 
Tensor cores consist of tiny ALUs designed for handling small variables (maximum FP16), focused on ML inference. Why would these be good for decompression of large datasets? Is there any research pointing to successfully using tensor cores for data decompression?

Same goes for INT8 and INT4 processing. Quad-rate INT8 just means it performs four INT8 operations in parallel, not that a single INT8 operation runs 4x faster.
I'm yet to see how data decompression can become infinitely parallel like graphics rendering or NN learning or NN inference.

For example, the Switch got a significant boost in loading times when Nintendo allowed the CPU cores to go up to 1.9GHz in loading screens. That's a boost coming from higher single-threaded performance. If the TX1 GPU's 256 shader cores with 2xFP16 throughput were any good for data decompression then it would have been implemented, as the shader cores are AFAIK mostly unused during loading screens.

It's not an area that I actively track, but there is a lot of research out there (including by Rad Game Tools the creators of Kraken) WRT GPU accelerated decompression. Just do a web search on GPGPU decompression and you'll get quite a few hits that you can peruse on your own.

The reason that PS4 and XBSX didn't use GPGPU accelerated decompression is that there are very strict restrictions placed on a console that don't exist on a PC. A console can also force universal inclusion of an esoteric hardware feature (like a hardware decompression block specific to graphics rendering).

As to whether the tensor cores would be useful for it, no idea. It's just a hypothesis as to why it was limited to the 2 GPUs which had tensor cores. But they are at heart just FP16 number crunching machines. It's likely that you couldn't "only" use the tensor cores, but it's also possible that the tensor cores could be used to assist in GPGPU decompression.

Dual issue FP32 just means more threads working in parallel, not higher single-threaded performance on FP32. The previous point still stands.

I didn't say anything that contradicts that. I was saying that with games likely needing to be coded specifically to take advantage of dual issue FP32, there is a lot of floating point capacity on Ampere for FP based GPU decompression without impacting the rendering in games. This would be similar to the PS4's dual issue FP16. Games needed to be specifically coded for it and most of the time it just sat there doing nothing.

Regards,
SB
 
how does all of this tie into it being inferior to fixed function hardware?
Ask yourself some questions: why do GPUs store compressed textures in their caches? Why do GPUs have dedicated hardware for decoding compressed textures? Why do GPUs have dedicated hardware for MIP level evaluation?

Do you see a trend here?

If you're going to assert that non-dedicated solutions are equivalent, let alone superior, then you really need to bring actual arguments. I'm just providing some checkboxes for you guys who want to go off and do some research.

In the meantime, Direct Storage looks cool. In a couple of years' time PC games will be doing sweet things with this. The API itself is not a quick fix for games that are about to launch.
 
As to whether the tensor cores would be useful for it, no idea. It's just a hypothesis as to why it was limited to the 2 GPUs which had tensor cores. But they are at heart just FP16 number crunching machines. It's likely that you couldn't "only" use the tensor cores, but it's also possible that the tensor cores could be used to assist in GPGPU decompression.
I suspect the integer thoughput of tensor ALUs is the key for decompression. Floating point math, per se, tends to discard too many bits to be of much use in decompression!
 
Ask yourself some questions: why do GPUs store compressed textures in their caches? Why do GPUs have dedicated hardware for decoding compressed textures? Why do GPUs have dedicated hardware for MIP level evaluation?

Do you see a trend here?

If you're going to assert that non-dedicated solutions are equivalent, let alone superior, then you really need to bring actual arguments. I'm just providing some checkboxes for you guys who want to go off and do some research.

In the meantime, Direct Storage looks cool. In a couple of years' time PC games will be doing sweet things with this. The API itself is not a quick fix for games that are about to launch.
Right, thanks for pointing the way here.

According to nvidia here: https://www.nvidia.com/en-us/geforce/news/rtx-io-gpu-accelerated-storage-technology/
Specifically, NVIDIA RTX IO brings GPU-based lossless decompression, allowing reads through DirectStorage to remain compressed while being delivered to the GPU for decompression. This removes the load from the CPU, moving the data from storage to the GPU in its more efficient, compressed form, and improving I/O performance by a factor of 2.

GeForce RTX GPUs are capable of decompression performance beyond the limits of even Gen4 SSDs, offloading dozens of CPU cores’ worth of work to deliver maximum overall system performance for next generation games.

So it's not really clear if RTX comes with a hardware decompression (that they've not talked about), or it's just using it's ALU to do the decompression. I guess for me, I understand fully that compute could be entirely inferior to dedicated hardware; but there is so much of it, that the end result is still very good. Does that mean the solution is not appropriate for PC? This part just seems like a question that nvidia would have to answer, which is whether it's worth the silicon to build dedicated hardware for compression/decompression or just to have it done through the existing hardware via brute force.
 
It's not an area that I actively track, but there is a lot of research out there (including by Rad Game Tools the creators of Kraken) WRT GPU accelerated decompression. Just do a web search on GPGPU decompression and you'll get quite a few hits that you can peruse on your own.

The reason that PS4 and XBSX didn't use GPGPU accelerated decompression is that there are very strict restrictions placed on a console that don't exist on a PC. A console can also force universal inclusion of an esoteric hardware feature (like a hardware decompression block specific to graphics rendering).


I did look for gpgpu decompression, and other than that LZW poster presentation (which uses a compression format that apparently isn't used in games because it's so slow compared to zlib) I'm yet to find a solution that is focused on higher performance rather than CPU offloading for other tasks.
id's Megatexture for example is single-threaded. It does get GPU compute assist on some part of the pipeline, but it's still just one CPU core handling the task.
I do welcome other data points if you bring them to the discussion though.

From what I can tell though, data decompression if done on a general purpose processor (GPU or CPU) seems to depend mostly on single threaded performance even if there are ways to make some things parallel.
Just like e.g. Javascript, you could implement dozens of Web workers to enhance performance but it only gets you so far.


As to whether the tensor cores would be useful for it, no idea. It's just a hypothesis as to why it was limited to the 2 GPUs which had tensor cores.
Could be just product feature partitioning. I.e. there'd be no reason why it wouldn't work on non-RTX Turing and even Pascal, other than not being in nvidia's best interests to do so.
 
As to whether the tensor cores would be useful for it, no idea. It's just a hypothesis as to why it was limited to the 2 GPUs which had tensor cores. But they are at heart just FP16 number crunching machines. It's likely that you couldn't "only" use the tensor cores, but it's also possible that the tensor cores could be used to assist in GPGPU decompression.
https://www.nvidia.com/content/dam/...ure/NVIDIA-Turing-Architecture-Whitepaper.pdf
According to the white paper their tensor cores support
Int8 and Int4

the FP16 ops have FP16 and FP32 accumulate afterwards (this is main use case for tensor cores for NN)
Each tensor core performs a 4x4 matrix multiply and accumulate in 1 clock cycle only on the 2 settings listed above. There are 8 tensor cores within each SM.
 
Another comparison point: AMD and NVidia GPUs both use delta colour compression. There have been a few generations of this technology and it all appears to be based upon dedicated hardware.

I don't know how much is known about the inner workings of these things. Some of this functionality would appear to be fairly similar to what's required for bulk game texture decompression, but I have no useful insights on this.
 
Yup. I recently got a motherboard with a PCI Express 4 and a nvme SSD just for the speed. Now it seems it might not be enough..

I don't see why it wouldn't be. The only specific requirement we've been told about is an NVMe SSD. If some kind of specific platform hardware support is required then I can't imagine a Zen 2 based platform (which I assume you must have) which already supports P2P DMA between any PCIe devices wouldn't suffice.
 
Another comparison point: AMD and NVidia GPUs both use delta colour compression. There have been a few generations of this technology and it all appears to be based upon dedicated hardware.

I don't know how much is known about the inner workings of these things. Some of this functionality would appear to be fairly similar to what's required for bulk game texture decompression, but I have no useful insights on this.

We already know texture decompression is possible in software, it's been happening for years on the PC and even on last gen consoles which feature hardware decompressors, developers sometimes chose to bypass those in favour of CPU based decompression - on a Jaguar! I'm not sure why it would now be so difficult to do this on programmable shaders with orders of magnitude more math throughput than a dozen Zen 2 CPU cores. Especially when Nvidia have very explicitly said it is possible.

Is it more efficient than dedicated hardware? Almost certainly not. But could it be equally or more capable than what is a likely quite a cheap hardware unit? I see no evidence to suggest it wouldn't be and I see Nvidia claiming very specifically that it will be.

Also for fans of Radgametools (Kraken developers), don't forget they have a very new GPU compression technology called BC7Prep which is decompressed at run time.... on the GPU.
 
We already know texture decompression is possible in software, it's been happening for years on the PC and even on last gen consoles which feature hardware decompressors, developers sometimes chose to bypass those in favour of CPU based decompression - on a Jaguar!
[...]
Also for fans of Radgametools (Kraken developers), don't forget they have a very new GPU compression technology called BC7Prep which is decompressed at run time.... on the GPU.
What is the format of the output data produced by these software routines?
 
Also for fans of Radgametools (Kraken developers), don't forget they have a very new GPU compression technology called BC7Prep which is decompressed at run time.... on the GPU.

BC7Prep is a pre-processing (offline) software tool for BC7 blocks that slightly increases compression ratio while maintaining compatibility with the format.
It has nothing to do with GPU decompression, it only says you can run parts of this offline tool on a GPU, specifically a transform reversal:

Oodle Texture also includes a lossless transform for BC7 blocks called "BC7Prep" that makes them more compressible. BC7Prep takes BC7 blocks that are often very difficult to compress and rearranges their bits, yielding 5-15% smaller files after subsequent compression. BC7Prep does require runtime reversal of the transform, which can be done on the GPU. BC7Prep can be used on existing BC7 encoded blocks, or for additional savings can be used with Oodle Texture RDO in near lossless mode. This allows significant size reduction on textures where maximum quality is necessary.
It's just the lossless pre-processing mode for the Oodle Texture Compression suite.
There's nothing in Oodle Texture Compression that refers to decompression through GPU Compute.

http://www.radgametools.com/oodletexture.htm
 
I suspect the integer thoughput of tensor ALUs is the key for decompression. Floating point math, per se, tends to discard too many bits to be of much use
I agree to this, explains the restrictions to RTX GPUs that are are equipped with Tensors. A confirmation of this could come if Titan V and Quadro V100 are supported as well.
 
BC7Prep is a pre-processing (offline) software tool for BC7 blocks that slightly increases compression ratio while maintaining compatibility with the format.
It has nothing to do with GPU decompression, it only says you can run parts of this offline tool on a GPU, specifically a transform reversal:


It's just the lossless pre-processing mode for the Oodle Texture Compression suite.
There's nothing in Oodle Texture Compression that refers to decompression through GPU Compute.

http://www.radgametools.com/oodletexture.htm

Except for Fabian Giesen words.

NcVzRvG.png

 
Do you know what disk fragmentation is
Yes but thats not a problem on ssd's where the sectors dont need to be next to each other because accessing a cluster takes the same time no matter where its located unlike mechanical hdd's
ram should be the same
 
Yes but thats not a problem on ssd's where the sectors dont need to be next to each other because accessing a cluster takes the same time no matter where its located unlike mechanical hdd's
ram should be the same

https://www.eurogamer.net/articles/digitalfoundry-inside-killzone-shadow-fall

Memory fragmentation is existing

Also fascinating is that by its own admission Guerrilla is not using much of the Compute functionality of the PS4 graphics core - in its conclusion to the presentation it says that there's only one Compute job in the demo, and that's used for memory defragmentation. Factoring in how much Sony championed the technology in its PS4 reveal, it's an interesting state of affairs and perhaps demonstrates just how far we have to go in getting the most out of this technology - despite its many similarities with existing PC hardware.
 
I wonder what portion caused the fragmentation. I thought every engine would have their own memory chunking setup (Memory Carver), where you carve off larger blocks of memory to be used for specific items. So instead of allocating memory for 1 item, you allocate a pool of 100 or 1000 all at once, and cycle through the unused indexes. It reminds me of the old school C programming when you had to manage the heap yourself, but then everyone learned you did things in larger chunks to manage fragmentation.

Maybe GPU resources don't lend themselves to that?
 
Back
Top