DirectStorage GPU Decompression, RTX IO, Smart Access Storage

There's literally no way of knowing and no-one should be pointing the finger of blame without any factual basis like some World Leader. Maybe MS created an API that can't work with nVidia and it's on MS? Maybe the API is great and nVidia just didn't implement it right? Maybe the API was following guidance from the IHVs but it turns out there was an issue overlooked by both sides and now they have to find some workaround?

Anyone stoically voicing who's fault it is without any insight whatsoever is talking faith, not science/engineering.

Please elevate the discussion to possible reasons why one GPU vendor is performing worse, ideally with comparing archs and software stacks to identify the differences. If that's not possible as the technical understanding is inadequate, just leave it at what it is until we know more actual facts.
 
Please elevate the discussion to possible reasons why one GPU vendor is performing worse, ideally with comparing archs and software stacks to identify the differences.
It's surprisingly simple. If you start using the copy engine for transfers, you are quickly approaching full saturation of the PCIe bus. At that point it boils down to how far you got your priorities straight, specifically if you can still manage to push command buffers to the other engines from the CPU even though the copy engine keeps pulling in data from the other direction.

If you didn't account for a smart back-off of the copy engine, then your latency goes south.

About 4 years or so back NVidia did already slightly dial back the burst size of the copy engine because it would get to the point where it would cause straight out DPC watchdog violations when used in SLI setups. (Simplified story - it actually also involved contention on a PCIe switch with an unfair balancing bias. But the root cause were the copy engines being able to put the PCIe bus under so much load without any gaps that you could starve entire PCIe devices on the same PCIe switch from beeing reachable for seconds at a time.)

I somewhat doubt that NVidia has already implemented the necessary PCIe protocol features to support out-of-bound priority messages for submission of the command buffers, compared to device initiated transfers run by the copy engine. Respectively if they got the priorities straight (on the GPU side) between transfers initiated by the copy engine, and transfers that don't originate from it (i.e. uploads of all others resources needed for processing the command buffers). And both of that is only addressing the GPU stalling itself - not the aformentioned impact on other PCIe devices.
 
Last edited:
One thing I noticed about direct storage was with Acronis anti-ransomware installed even though everything said it was enabled, that prevented it from functioning. I didn’t find that out until the Xbox app told me.
 
RTX Neural Texture Compression is tested, showing promising results, massive VRAM ans disk space savings.

At 1440p/TAA on a 4090, there are no performance penalties. When DLSS is used, there is a perf hit.


When the "performance loss" @ 4K is going from 1116 FPS to 951 fps, I have a hard time taking it serious as we are talking synthetic values, not real world performanace.
 
RTX Neural Texture Compression is tested, showing promising results, massive VRAM ans disk space savings.

At 1440p/TAA on a 4090, there are no performance penalties. When DLSS is used, there is a perf hit.


This will be an interesting one to compare on Blackwell to see if there is as much performance loss when use the NTC modes. Really curious to see if the scheduling improvements for neural shaders on blackwell eliminate some of that loss.
 
This will be an interesting one to compare on Blackwell to see if there is as much performance loss when use the NTC modes. Really curious to see if the scheduling improvements for neural shaders on blackwell eliminate some of that loss.
Far more interesting will be to see how well this actually scales to the non-halo models that don't have as much cache size and excess memory bandwidth available. Because the performance cost from the additional stress on the memory system can be expected to be a constant across all render resolutions (unlike with mipmapped textures where it does scale), and should impact the smaller chips disproportionately. Only the arithmetic portion is still scaling with native render resolution, but that shouldn't be the limiting factor with those neural textures.
 
Curious observation on the side - the Windows IORing API appears to under-perform with ReFS file systems. The performance difference to Overlapped file API essentially flips around, from being up to 15% faster on NTFS to actually being 10-15% slower. But still with a lot less CPU usage - IORing delivers 80%+ of the Overlapped API peak throughput with just 1 partially loaded CPU core where the Overlapped API is running completion handlers on multiple cores...

That's relevant if you have set up a "Dev Drive" (ReFS file system, reduced filter driver set, async AV) as suggested by Microsoft for your development workflow, and thus placed your assets on a ReFS formated volume rather than a NTFS formated volume.

Edit: Apparently ReFS lacks support for block level "Direct Mode" which IORing builds upon when backed by NTFS. Old limitation, that was documented for Windows Server back in 2020 and not resolved yet. So with ReFS, "DirectStorage" will NOT bypass the file system. Remarkable that it's still almost keeping up to the native SSD speeds despite that.
 
Last edited:
When the "performance loss" @ 4K is going from 1116 FPS to 951 fps, I have a hard time taking it serious as we are talking synthetic values, not real world performanace.
Yeah it is more interesting to look at the milliseconds value ->
4K TAA + NTC Inference = 1535 fps = 0.65 millisecond render time
4K DLSS + NTC Inference = 968 fps = 1.03 millisecond render time

So loading up the Tensor cores more here has a roughly .40 millisecond render time penalty. DLSS is slower than native + TAA as a result. I imagine though in a real game that increased render time would be dramatically offset by the performance won back with rendering at a lower resolution from DLSS. The sample is has too simple visuals to show that.
 
So loading up the Tensor cores more here has a roughly .40 millisecond render time penalty.
And that's with only ~7 NTC textures used for the scene! Each tensor has a 1.5 MB footprint, so e.g. a 4060 could only use less than 16 NTC textures in the same batch of draw calls before the L2 cache is spilling. The L1 can't even hold a single tensor, so that's burning L2 Cache bandwidth like crazy, and you can bet this would never be replicable with a sightly more modern title that uses a higher number of different assets in the same scene.

-85% reduction of VRAM footprint traded for a +30000% increase in cache footprint.

That demo was tailored to just barely fit textures and geometry into the L2 cache of an RTX 4060.
 
Last edited:
Far more interesting will be to see how well this actually scales to the non-halo models that don't have as much cache size and excess memory bandwidth available. Because the performance cost from the additional stress on the memory system can be expected to be a constant across all render resolutions (unlike with mipmapped textures where it does scale), and should impact the smaller chips disproportionately. Only the arithmetic portion is still scaling with native render resolution, but that shouldn't be the limiting factor with those neural textures.
Their sampling method still uses mip-mapping to avoid trashing the cache, right? At least in their stochastic sampling paper they explicitly mention that and call it their 'hybrid' approach where they choose a lower mip-map than normal as a balance between quality and performance.
 
Their sampling method still uses mip-mapping to avoid trashing the cache, right? At least in their stochastic sampling paper they explicitly mention that and call it their 'hybrid' approach where they choose a lower mip-map than normal as a balance between quality and performance.
I don't think they actually did for their most recent demo. They are now differentiating between a "NTC on load" and "NTC on sample" mode. "NTC on load" appears to be creating a BC7 compressed texture (albeit likely with severe artifacts, BC7 is hard to do right under time constraints). "NTC on sample" - at least in this demo - doesn't appear to be using mip-maps at all, going by the reported VRAM usage which matches exactly the tensor size.

I expect that "hybrid" approach had failed because you rarely have the comfort that your UV mapping is so perfectly uniform that you can avoid to have to use the upper LOD levels entirely reliably. The caches are trashed even when you need to query the NTC for just a single fragment.

EDIT: That naive "hybrid" approach with serving the high LOD from NTC and the low from mip-maps will most likely be yet again superseded by a virtual texturing approach that can populate the virtualized texture from NTC. But we are not at that point yet to have a demo for that. In fact, that might be the better alternative to "NTC on load".

EDIT2: Just remembered the issue with the hybrid approach. It was mentioned that the NTC would encode viewing angle dependent material behavior that was outside of the texture based PBR material definition. So there's potentially a significant mismatch between the mip-maps based material and the NTC based material outside of the ideal orthogonal projection used to populate the mip-maps.
 
Last edited:
I don't think they actually did for their most recent demo. They are now differentiating between a "NTC on load" and "NTC on sample" mode. "NTC on load" appears to be creating a BC7 compressed texture (albeit likely with severe artifacts, BC7 is hard to do right under time constraints). "NTC on sample" - at least in this demo - doesn't appear to be using mip-maps at all, going by the reported VRAM usage which matches exactly the tensor size.

I expect that "hybrid" approach had failed because you rarely have the comfort that your UV mapping is so perfectly uniform that you can avoid to have to use the upper LOD levels entirely reliably. The caches are trashed even when you need to query the NTC for just a single fragment.

Edit: That naive "hybrid" approach with serving the high LOD from NTC and the low from mip-maps will most likely be yet again superseded by a virtual texturing approach that can populate the virtualized texture from NTC. But we are not at that point yet to have a demo for that.

EDIT2: Just remembered the issue with the hybrid approach. It was mentioned that the NTC would encode material behavior that was outside of the texture based PBR material definition. So there's potentially a significant mismatch between the mip-maps based material and the NTC based material outside of the ideal orthogonal projection used to populate the mip-maps.
In there paper they say that the mipmap information only accounts of ~6.7% or less (instead of 33% as with normal mipmap generation) because they can share the feature vectors to predict multiple mip-map levels so it could be that you will not notice the difference in VRAM usage with it being on/off.
I was curious and looked in the source. With STF they set the sample mode to STF_ANISO_LOD_METHOD_DEFAULT, you can see this here or here which in the STF library does compute a mipmap lod based on uv gradients.
 
With STF they set the sample mode to STF_ANISO_LOD_METHOD_DEFAULT, you can see this here or here which in the STF library does compute a mipmap lod based on uv gradients.
NtcForwardShadingPass.hlsl does not actually use the sampler object to sample any texture, only for estimating the sample position which is the always used as input to the network (https://github.com/NVIDIA-RTX/RTXNT...3/include/libntc/shaders/Inference.hlsli#L584).

LegacyForwardShadingPass.hlsl is the path for "NTC on load" and "classic texture" and is exclusively texture based.

STF_ANISO_LOD_METHOD_DEFAULT is just used to select the correct mip level when anisotropic filtering would be required. But I can't even tell if it has any effect at all in the non-legacy shader, it's just passed as an input to the tensor.

There is no hybrid approach in either of those shaders
 
Last edited:
The sampler object is not a sampler but just a struct. The function itself outputs a miplevel param
NtcForwardShadingPass.hlsl does not actually use the sampler object to sample any texture, only for estimating the sample position which is the always used as input to the tensor. LegacyForwardShadingPass.hlsl is the path for "NTC on load" and "classic texture" and is exclusively texture based.

STF_ANISO_LOD_METHOD_DEFAULT is just used to select the correct mip level when anisotropic filtering would be required. But I can't even tell if it's used at all in the non-legacy shader, it's just passed as an input to the tensor.

There is no hybrid approach in either of those shaders
sampler.Texture2DGetSamplePos returns the mip level in the .z parameter. The GetSamplePositionWithSTF function outputs the mipLevel param and that is later fed into the NtcSampleTextureSet_CoopVec_ methods to sample the correct mipmap level.
 
that is later fed into the NtcSampleTextureSet_CoopVec_ methods to sample the correct mipmap level.
That's slightly more complicated. The mip-level is used by NtcSampleTextureSet_CoopVec_ to select a couple of additional input arguments ("latent features" for a given mip level from a much lower resolution mip-map like structure) to the neural network from an additional buffer. But it still doesn't switch back to texture sampling, it's always running the full network regardless of MIP level. I guess that additional buffer are those +6.7% of "mip-map" from the paper.

It doesn't change that the real cost of NTC is that huge unique weight buffer per texture, and that is needed for every sample regardless of mip-level.
 
Last edited:
The tweet could have done with expressing. These posts are showing noise from no texture mip filtering, and the sampling for mips uses dithering. This is now introducing noise to textures where textures to date have had none. As pointed out in some comments, we are taking scenes that are already noisy from low sampling and now adding noise to the textures too. There won't be a stable pixel on the screen!

The question is how to get around this. Can mip levels be introduced to the encoding?
 
Back
Top