Velocity Architecture - Limited only by asset install sizes

I mean, I guess I'm just thinking about space fragmentation in memory, you've got 200MB left in each pool but the next set of assets requires 250MB for instance, neither pool can hold it and now you're playing tetris to make things work. It's a slightly different argument from bandwidth, and this problem doesn't exist with a unified setup. So if you need to make space, the question is how and what performance impacts will the system suffer for making space. Or you can forgo it and try to load more things just in time. (which brings us back to the argument around why velocity architecture is needed).

IIRC HBCC would have handled this issue really well or whatever DMA controllers they're using in the Series X and PS5. If you think about it as a virtual address space and not the physical address space, the OS should be able to load the 250MB into the right physical memory from disk I/O. The displaced 200MB would simply be retrieved when needed from the phyiscal location in the SSD but will remain present in the whole virtual RAM. Unless I misunderstood what you were saying.
 
Maybe this is not fully optimized. Loading are slow in Hitman 3 around 6 to 7 seconds compared to Spiderman Miles Morales for example where it is around 2 seconds. In an optimized title on PS5 the CPU is not involved into I/O at all.

If you assume Hitman 3 is not optimized for the PS5 you'd have to assume the same for the Series X. As far as we know both games were optimized for their systems to the best of IO interactive's ability. And as far as we're concerned we haven't seen nor will we ever see how the velocity architecture loads Spiderman. I still think the actual SSD is faster in the PS5 but the Series X is definitely performing just as well or even better despite having half the effective throughput.

It's honestly impressive thus far. My thoughts are that later on when engines are more optimized we'll start seeing a 2x advantage but at very low loadtimes. Say 2 vs 4 seconds. But in games with such low load times like Fifa and NBA its been a draw with games loading in under 3 seconds on both machines.
 
I mean, I guess I'm just thinking about space fragmentation in memory, you've got 200MB left in each pool but the next set of assets requires 250MB for instance, neither pool can hold it and now you're playing tetris to make things work. It's a slightly different argument from bandwidth, and this problem doesn't exist with a unified setup. So if you need to make space, the question is how and what performance impacts will the system suffer for making space. Or you can forgo it and try to load more things just in time. (which brings us back to the argument around why velocity architecture is needed).
When you allocate memory you are unlikely to do so at such a big granularity. Unless you absolutely need a whole 10GB of content to be ready for command at a given time, this is probably a non issue.
 
When you allocate memory you are unlikely to do so at such a big granularity. Unless you absolutely need a whole 10GB of content to be ready for command at a given time, this is probably a non issue.
Yea, I think this is true. I guess I was just making a simple example to understand really.
But I would hope the granularity is small enough to make this issue a non-factor.
 
IIRC HBCC would have handled this issue really well or whatever DMA controllers they're using in the Series X and PS5. If you think about it as a virtual address space and not the physical address space, the OS should be able to load the 250MB into the right physical memory from disk I/O. The displaced 200MB would simply be retrieved when needed from the phyiscal location in the SSD but will remain present in the whole virtual RAM. Unless I misunderstood what you were saying.
I dunno, maybe? Unfortunately I don't understand enough to really know what's happening behind the scenes, the documentation isn't exhaustive.
 
I dunno, maybe? Unfortunately I don't understand enough to really know what's happening behind the scenes, the documentation isn't exhaustive.
Yeah HBCC or SFS would handle what you described really well. I wrote a post about it on reddit a while ago. Efficient demand paging of data into RAM. Do you have a link to the documentation?
 
If you assume Hitman 3 is not optimized for the PS5 you'd have to assume the same for the Series X. As far as we know both games were optimized for their systems to the best of IO interactive's ability. And as far as we're concerned we haven't seen nor will we ever see how the velocity architecture loads Spiderman. I still think the actual SSD is faster in the PS5 but the Series X is definitely performing just as well or even better despite having half the effective throughput.

It's honestly impressive thus far. My thoughts are that later on when engines are more optimized we'll start seeing a 2x advantage but at very low loadtimes. Say 2 vs 4 seconds. But in games with such low load times like Fifa and NBA its been a draw with games loading in under 3 seconds on both machines.

Half of the suspected effective throughput, at that; people really need to read through the FlashMap papers. I probably have them bookmarked but, I'm not searching my bookmarks for them right now xD.

Yeah HBCC or SFS would handle what you described really well. I wrote a post about it on reddit a while ago. Efficient demand paging of data into RAM. Do you have a link to the documentation?

I'm surprised there's no HBCC in either system, but I guess their designs did not require it. I wonder if MS's handling of the fast/slow memory pools is similar to how Intel has apps implement memory control using Optane Persistent Memory in App Direct, or in Memory Direct mode.

IIRC the latter just treats the DRAM as a last-level cache and the system sees it and the Optane memory as one large contingent memory pool, while App Direct mode has the software and OS treat them as two separate memory pools. I'm guessing given the challenges some 3P devs seem to be having MS could have the OS and game apps handling the fast and slow memory pools in an equivalent to some App Direct mode seen with Optane memory & DRAM.
 
Last edited:
I'm surprised there's no HBCC in either system, but I guess their designs did not require it.
If I had to guess, the DMA controllers in the Series X do the equivalent or similar work to HBCC. I think the SFS hw in the GPU idenifies what pages will be needed and the DMA controller makes sure they're resident in RAM. Otherwise all assets remain as part of the virtual address space.


I wonder if MS's handling of the fast/slow memory pools is similar to how Intel has apps implement memory control using Optane Persistent Memory in App Direct, or in Memory Direct mode.

IIRC the latter just treats the DRAM as a last-level cache and the system sees it and the Optane memory as one large contingent memory pool, while App Direct mode has the software and OS treat them as two separate memory pools. I'm guessing given the challenges some 3P devs seem to be having MS could have the OS and game apps handling the fast and slow memory pools in an equivalent to some App Direct mode seen with Optane memory & DRAM.

I have no idea how this works.
 
If I had to guess, the DMA controllers in the Series X do the equivalent or similar work to HBCC. I think the SFS hw in the GPU idenifies what pages will be needed and the DMA controller makes sure they're resident in RAM. Otherwise all assets remain as part of the virtual address space.
I think the SF (no Streaming) hardware only: 1) records what part & which mips level is sampled and 2) makes sampling of the feedback map of higher accuracy by having custom filter. The Streaming part I would imagine involves the CPU readingback the feedback map and encodes DirectStorage commands accordingly.
 
I think the SF (no Streaming) hardware only: 1) records what part & which mips level is sampled and 2) makes sampling of the feedback map of higher accuracy by having custom filter. The Streaming part I would imagine involves the CPU readingback the feedback map and encodes DirectStorage commands accordingly.
Doesn't sound right to me.
SF is a DX12U feature
SFS is a XS feature that includes the custom filter.
 
Twas implied to be hardware specific at one time. (or at least tasks that could be done on cpu, elsewhere)
">"></script>
I wonder if it's a feature that's in AMD hardware, but hasn't been uncovered in DX,,
SF is supported in all DX12U hardware.
SFS is custom additions made by MS for XS, features like the custom filter, etc.

Unsure how much performance SFS will give above and beyond SF, it may just help with simpler development.
But I suspect there will be benefits beyond simpler development otherwise may not have been worth the effort to add it as SF would need to be implemented in the PC engine anyway.
 
I think the SF (no Streaming) hardware only: 1) records what part & which mips level is sampled and 2) makes sampling of the feedback map of higher accuracy by having custom filter. The Streaming part I would imagine involves the CPU readingback the feedback map and encodes DirectStorage commands accordingly.
Whatever page of data the custom filter identifies as not resident in RAM will have to be demand paged in. The DMA controllers would play a part in this process.
 
Jay wrote: "SF is supported in all DX12U hardware.
SFS is custom additions made by MS for XS, features like the custom filter, etc."

Yep, I grabbed the wrong tweet initially.
I know a Sony engineer had mentioned writing a shader to query which textures are needed, and the sampler can report if the texture is resident in memory. (twas a generic comment, but maybe some insight into what they have in hardware (or not) on the PS5 side.
 
SF is supported in all DX12U hardware.
SFS is custom additions made by MS for XS, features like the custom filter, etc.

Unsure how much performance SFS will give above and beyond SF, it may just help with simpler development.
But I suspect there will be benefits beyond simpler development otherwise may not have been worth the effort to add it as SF would need to be implemented in the PC engine anyway.

Read the MS doc I linked. There is no mention of the Xbox. XS may have customizations for SFS, but this doc makes no declarations that SFS is limited to consoles.


How to adopt Sampler Feedback for Streaming
To adopt SFS, an application does the following:

  • Use a tiled texture (instead of a non-tiled texture), called a reserved texture resource in D3D12, for anything that needs to be streamed.
  • Along with each tiled texture, create a small “MinMip map” texture and small “feedback map” texture.
    • The MinMip map represents per-region mip level clamping values for the tiled texture; it represents what is actually loaded.
    • The feedback map represents and per-region desired mip level for the tiled texture; it represents what needs to be loaded.
  • Update the mip streaming engine to stream individual tiles instead of mips, using the feedback map contents to drive streaming decisions.
  • When tiles are made resident or nonresident by the streaming system, the corresponding texture’s MinMip map must be updated to reflect the updated tile residency, which will clamp the GPU’s accesses to that region of the texture.
  • Change shader code to read from MinMip maps and write to feedback maps. Feedback maps are written using special-purpose HLSL constructs.
 
Last edited:
Read the MS doc I linked. There is no mention of the Xbox. XS may have customizations for SFS, but this doc makes no declarations that SFS is limited to consoles.


How to adopt Sampler Feedback for Streaming
To adopt SFS, an application does the following:

  • Use a tiled texture (instead of a non-tiled texture), called a reserved texture resource in D3D12, for anything that needs to be streamed.
  • Along with each tiled texture, create a small “MinMip map” texture and small “feedback map” texture.
    • The MinMip map represents per-region mip level clamping values for the tiled texture; it represents what is actually loaded.
    • The feedback map represents and per-region desired mip level for the tiled texture; it represents what needs to be loaded.
  • Update the mip streaming engine to stream individual tiles instead of mips, using the feedback map contents to drive streaming decisions.
  • When tiles are made resident or nonresident by the streaming system, the corresponding texture’s MinMip map must be updated to reflect the updated tile residency, which will clamp the GPU’s accesses to that region of the texture.
  • Change shader code to read from MinMip maps and write to feedback maps. Feedback maps are written using special-purpose HLSL constructs.
Your probably right, I remember the whole conversation about what is and isn't custom.
Only thing I know for sure is the custom filter.
MS would mangle VA, SFS etc when talking custom and improvemens sometimes.

I'm still not clear what they actually meant when they said they have the full RDNA2, but then I've not been following anything for couple months now.

Do we know which games/engines are using tiled resources? As those would be the candidates to be updated to SFS part of VA sooner rather than later.
 
Read the MS doc I linked. There is no mention of the Xbox. XS may have customizations for SFS, but this doc makes no declarations that SFS is limited to consoles.


How to adopt Sampler Feedback for Streaming
To adopt SFS, an application does the following:

  • Use a tiled texture (instead of a non-tiled texture), called a reserved texture resource in D3D12, for anything that needs to be streamed.
  • Along with each tiled texture, create a small “MinMip map” texture and small “feedback map” texture.
    • The MinMip map represents per-region mip level clamping values for the tiled texture; it represents what is actually loaded.
    • The feedback map represents and per-region desired mip level for the tiled texture; it represents what needs to be loaded.
  • Update the mip streaming engine to stream individual tiles instead of mips, using the feedback map contents to drive streaming decisions.
  • When tiles are made resident or nonresident by the streaming system, the corresponding texture’s MinMip map must be updated to reflect the updated tile residency, which will clamp the GPU’s accesses to that region of the texture.
  • Change shader code to read from MinMip maps and write to feedback maps. Feedback maps are written using special-purpose HLSL constructs.
There is a custom filter for XSX however wrt SFS.

The hardware will pull the tile from SSD as required, but will also generate a matching somewhat generated coloured tile to insert into the frame incase the texture doesn't arrive in time to avoid any sort of hiccuping etc. And the next frame the tile will be there.
 
Back
Top