Velocity Architecture - Limited only by asset install sizes

I don't. Whether 70ms or 100ms or even 30ms, a direct fetch is too slow to be immediately addressable. You'll still be working with prefetching. The difference will be prefetching two frames in advance or six.

Sure, but you don't need some virtualised RAM to do that.

That's the thing. Your texture page does not need to be immediately addressable. The latency requirement is that it should be available in at most a couple of frametimes.
Also, though you will again disagree, the goal of sampler feedback is to reduce prefetching to a minimum. You don't need to guess what will become viewable two frames in advance and try to prefetch it: with sampler feedback you know what is in the current view frustrum and work from there.
 
Also, though you will again disagree, the goal of sampler feedback is to reduce prefetching to a minimum.
I don't disagree. ;) The only thing I've said against SF is it'll never be able to load textures on demand from a GPU request. It'll still be prefetching, but more accurately than without and a more efficient version of texture streaming. I have no idea what the real world gains will be, although I struggle to envision a situation where they are significant. SFS may provide a nice quality boost by softening transitions.

As for the need to have your texture available in a couple of frame times, the difference of 70ms or 100ms is a few frames; it's not a huge deal. If you are designing a system that'll cope okay with 100ms latency, a system that shaves 30 ms off that isn't going to enable anything different; it'll just produce a quality difference with a little less pop-in. It won't be a game changer as function theorises. SSDs are already the next game changer. We need another order of magnitude or two latency reduction over the next-gen SSDs to get the next game changer between DRAM and storage. Probably not even that (need to be way lower latency). The next step up will require storage access so fast it could be accessed on demand mid frame, which is DRAM/Optane level fast. Anything slower than that, you are still working with prefetching. It might mean less, but it's the same solution, the same 'game'.
 
I don't. Whether 70ms or 100ms or even 30ms, a direct fetch is too slow to be immediately addressable. You'll still be working with prefetching. The difference will be prefetching two frames in advance or six.

Effective latency doesn't just come from getting data from the SSD, it also includes whatever processing needs to be done to have the data initialised / formatted / loaded into the program memory space, with a pointer to the heap, with a reference in the page table etc. On PC pulling something already setup in virtual memory is, afaik, much faster than trying to load data in from the normal filesystem.

And in terms of latency, as long as the page table has a reference to it, and you know it'll be coming it becomes a question of whether you can wait for it or not. And for a number of things I have to say I that I think you could if you build around it. Some possible examples:

- If you realise you need a new texture page in memory at the start of a frame (sudden camera move), and you have lots of other work you can get on with in the mean time, you could trigger the page swap and hope it would be there. If it's not, fine, you go with whatever you have, but if latency is lower you have a greater chance of using whatever you're after that frame.

- If you make an AI state change, AI state changes are latency tolerant enough that you can load the new state behaviour in on the fly and begin using it when it's in. Possibly within the same frame that the change happens. You wouldn't necessarily need to prefetch it - you simply let the system page in the relevant data when the decision about the state change is made. Flag for the game to get back to this AI subroutine in, say, 500 microseconds (about 3% of a 60 fps frame), and free up resources for other processing work in the mean time.

- Animation. Same as above really. Characters typically have lots of different animation routines, and you normally move / blend between them based on a decision making process. Animation is decoupled from frame rate these days. Need to start blending in a new animation? Don't prefetch it, just pull it in on demand.

Audio? Just pull in new music, dialog and sound effects on the fly even if you're triggering scores of different ones at the same time (audio was one of Cerny's GDC examples for their six priority level SSD access, so he's definitely thinking along these lines too).

There are lots of places where you might not need to prefetch and where lower latency would be a boon! We're going to start seeing on demand, sub-frame time accesses to data on the SSDs. The Dirt 5 technical director has already explicitly stated that XSX is very capable of this. I think PS5 will be too, though possibly with different system level processes for handling this.

Sure, but you don't need some virtualised RAM to do that.

You don't need it, but it could make the process of pulling data into main memory faster and lower overhead than current console and PC methods. Infact, I'm inclined to think it definitely can make it faster in some circumstances - the PC pagefile is a good example of this. And a similar system with a high level of compression of the virtualised ram on SSD (compressed at build time), with a very fast and low latency decompression / IO block would minimise the transfer requirements.

Virtualised memory also allows for an increased degree of abstraction, not just for one particular system but for a generation of consoles (XSX and Lockhart) with very different memory profiles.

I think they key to managing Lockhart downgrades gracefully and time effectively will be investing in time in using MS's "velocity architecture" well. Know what you can leave to the virtual memory system and what you need to manage explicitly will be very important.

Yup. Assuming Microsoft does have a latency advantage I think that the window of latency and relative difference will be far less pronounced than the different in raw bandwidth. But if you're talking 2.5 seconds to load to the point where you're ready to play vs 4.8 seconds, it's still a huge advance over what console and PC gamers experience now.

I mean yeah it's definitely hard to see how the PC can mimic next gen consoles. Though gobs of DDR4 should be able to make up for most things ...

My thoughts about XSX and Velocity Architecture aren't really in opposition to the PS5 SSD btw. I do think MS are targetting latency, but I doubt the PS5 is going to be a slouch. Some of what I've supposed might be more difficult for them as they have to support off the shelf SSDs, but the six channel SSD seems to be about ... well ... managing latency for latency critical accesses.

I also wouldn't be surprised to see a similar option for pulling data from SSD "ready to go" like the one I think MS are going for with the "virtual memory".

Right across next gen MS and Sony might have slightly different approaches, but their goals seem to be very similar.
 
Right across next gen MS and Sony might have slightly different approaches, but their goals seem to be very similar.

Yup, the two console architectures are more alike than different. Microsoft put more into the GPU, Sony more into the SSD, audio system and batshit-crazy design. But they're closer in capability/performance than any two previous competing consoles I can think of.
 
Effective latency doesn't just come from getting data from the SSD, it also includes whatever processing needs to be done to have the data initialised / formatted / loaded into the program memory space, with a pointer to the heap, with a reference in the page table etc. On PC pulling something already setup in virtual memory is, afaik, much faster than trying to load data in from the normal filesystem.
Oh sorry, I saw ms and read it as milliseconds, despite you naming microseconds before hand. :oops: Conditioning. We haven't ever had speeds where we're dealing with µ to date!

If dealing with sub-frame access times, yeah, it becomes useable as live data, but still 30% difference in lag isn't going to change what one console can do over the other. If you can delay a job 70 µs, you can delay it 100 µs. It still won't be a game changer. TBH any 30% faster or even 50% or even 2x faster isn't a game changer. We tend to be looking at an order of magnitude because things can be genuinely different.

I also don't think the on-the-fly methods you suggest are particularly realistic.

Anything visuals like textures or animation, you are processing now as it's in the pipeline. The possibility of interrupting what you are currently drawing to go fetch some data and come back to it seems highly suboptimal. You want all your drawing and whatnot to be laid out ready to process, and not hop around jobs waiting on data. If something's not right this frame, fetch it this frame for next frame with decoupled IO buffering of your data for next frame without holding up any jobs.

If what you suggest were to happen, I think we'd need some pretty different software. You'd need a sort of data management system, and jobs would request data to process, and this manager would stall them if it wasn't on hand, and you'd need very more flexible execution than we have now. "Right, draw this wall. Shit, the texture's not ready. Okay, play that sound. And that one. Bugger that's not ready. Okay, process this AI. Audio's loaded? Right play that sound. Now finish that AI. Okay, onto drawing this wall, and then'll work out where that ball is bouncing..."

Having gotten all excited about DOTS and the burst compiler and the efficiency that comes with entities, this sounds like a real step backwards in performance. To maximise your processing, you want everything in cute, linear blocks of data that you can churn through. Linear data access is going to be huge in making the most of that available BW and cache efficiencies.
 
Last edited:
Some choice highlighting of what they say near the end, makes it seem like it's not limited to 100 GB.

Through the massive increase in I/O throughput, hardware accelerated decompression, DirectStorage, and the significant increases in efficiency provided by Sampler Feedback Streaming, the Xbox Velocity Architecture enables the Xbox Series X to deliver effective performance well beyond the raw hardware specs, providing direct, instant, low level access to more than 100GB of game data stored on the SSD just in time for when the game requires it. These innovations will unlock new gameplay experiences and a level of depth and immersion unlike anything you have previously experienced in gaming.​
 
Despite the ability for modern game engines and middleware to stream game assets into memory off of local storage, level designers are still often required to create narrow pathways, hallways, or elevators to work around the limitations of a traditional hard drive and I/O pipeline. These in-game elements are often used to mask the need to unload the prior zone’s assets from memory while loading in new assets for the next play space. As we discussed developers’ aspirations for their next generation titles and the limitations of current generation technology, this challenge would continue to increase exponentially and further constrain the ambition for truly transformative games.
The future was reams and reams of corridors! :runaway:

If our custom designed processor is at the heart of the Xbox Series X, the Xbox Velocity Architecture is the soul.
Ugh.

The Xbox Velocity Architecture comprises four major components: our custom NVME SSD, hardware accelerated decompression blocks, a brand new DirectStorage API layer and Sampler Feedback Streaming (SFS).
So nothing there about virtual memory per se.
New DirectStorage API: Standard File I/O APIs were developed more than 30 years ago and are virtually unchanged while storage technology has made significant advancements since then. As we analyzed game data access patterns as well as the latest hardware advancements with SSD technology, we knew we needed to advance the state of the art to put more control in the hands of developers. We added a brand new DirectStorage API to the DirectX family, providing developers with fine grain control of their I/O operations empowering them to establish multiple I/O queues, prioritization and minimizing I/O latency. These direct, low level access APIs ensure developers will be able to take full advantage of the raw I/O performance afforded by the hardware, resulting in virtually eliminating load times or fast travel systems that are just that . . . fast.
Through the massive increase in I/O throughput, hardware accelerated decompression, DirectStorage, and the significant increases in efficiency provided by Sampler Feedback Streaming, the Xbox Velocity Architecture enables the Xbox Series X to deliver effective performance well beyond the raw hardware specs, providing direct, instant, low level access to more than 100GB of game data stored on the SSD just in time for when the game requires it.
In summary, this 100 GB is just a marketing point for the general IO architecture.
 
we
well, you provided the information here that I think was being missed dramatically. So to be put in layman's words here.


Sampler Feedback Streaming (SFS): Sampler Feedback Streaming is a brand-new innovation built on top of all the other advancements of the Xbox Velocity Architecture. Game textures are optimized at differing levels of detail and resolution, called mipmaps, and can be used during rendering based on how close or far away an object is from the player. As an object moves closer to the player, the resolution of the texture must increase to provide the crisp detail and visuals that gamers expect. However, these larger mipmaps require a significant amount of memory compared to the lower resolution mips that can be used if the object is further away in the scene. Today, developers must load an entire mip level in memory even in cases where they may only sample a very small portion of the overall texture. Through specialized hardware added to the Xbox One X, we were able to analyze texture memory usage by the GPU and we discovered that the GPU often accesses less than 1/3 of the texture data required to be loaded in memory. A single scene often includes thousands of different textures resulting in a significant loss in effective memory and I/O bandwidth utilization due to inefficient usage. With this insight, we were able to create and add new capabilities to the Xbox Series X GPU which enables it to only load the sub portions of a mip level into memory, on demand, just in time for when the GPU requires the data. This innovation results in approximately 2.5x the effective I/O throughput and memory usage above and beyond the raw hardware capabilities on average. SFS provides an effective multiplier on available system memory and I/O bandwidth, resulting in significantly more memory and I/O throughput available to make your game richer and more immersive.

So that is what SFS does. This language is more straight forward however.
This was explained in the SFS direct X video, but it was still hard to understand.

For better clarification on what is attempting to be solved, with respect to streaming and virtual texturing tiles. You can see how many tiles are _not_ fully visible by the player. Most of them are -partial- tiles. But you have to load the whole mip anyway.

My visual interpretation of what SFS is doing vs standard Virtual Texturing. Looking at it this way does provide a lot of reason why many developers opted to not use Tiled Resources as the hardware function was locked to 64K size tiles IIRC. That's incredibly wasteful without SFS. A smaller tile will provide a better efficiency at the cost of other things.
YCcJx49.png



 
Last edited:
With this insight, we were able to create and add new capabilities to the Xbox Series X GPU which enables it to only load the sub portions of a mip level into memory, on demand, just in time for when the GPU requires the data. This innovation results in approximately 2.5x the effective I/O throughput and memory usage above and beyond the raw hardware capabilities on average.
However, this is a marketing piece. Loading a mip level on demand, just in time, is a matter of latency, not bandwidth.

This is in a piece with the words "instant access to 100 GB/s". Instant here isn't a technical measure. I don't know that 'just in time' should be taken verbatim. SFS blends mip levels as I understand it; it doesn't add more than SF does. And sample feedback happens during drawing, at which point it's too late to fetch a high mip level because you need that texture in RAM now as that texture is being drawn. Hence the desire for SFS to blend between streamed textures; that would be a redundant feature if the correct mip was always loaded.

I trust the 2.5x effective IO (for textures) from the selective loading. That's a realistic improvement. The whole '100 GB instant access' is just fluff. It's overall a fast SSD driven IO system collectively called the 'Velocity Architecture' that allows the game data ("100 GB") to be accessed freely at low latency. The 100 GB figure isn't any measure of anything and doesn't represent some specific 100 GB portion of VM or caching or clever paging.
 
However, this is a marketing piece. Loading a mip level on demand, just in time, is a matter of latency, not bandwidth.

This is in a piece with the words "instant access to 100 GB/s". Instant here isn't a technical measure. I don't know that 'just in time' should be taken verbatim. SFS blends mip levels as I understand it; it doesn't add more than SF does. And sample feedback happens during drawing, at which point it's too late to fetch a high mip level because you need that texture in RAM now as that texture is being drawn. Hence the desire for SFS to blend between streamed textures; that would be a redundant feature if the correct mip was always loaded.

I trust the 2.5x effective IO (for textures) from the selective loading. That's a realistic improvement. The whole '100 GB instant access' is just fluff. It's overall a fast SSD driven IO system collectively called the 'Velocity Architecture' that allows the game data ("100 GB") to be accessed freely at low latency. The 100 GB figure isn't any measure of anything and doesn't represent some specific 100 GB portion of VM or caching or clever paging.
I updated my post with a picture, it's about taking a portion of the mip (the area sampled) and not needing to load the whole mip. With my post I just wanted to clearly showcase what the function of Sampler Feedback was for readers that I don't think understood it's purpose and why it's necessary.

It's clear Sampler Feedback has performance implications and you do not sample all the time as noted by the DirectX talk on it. But if you can sample and load only what you need, you save on I/O and bandwidth.

You will likely get hit on latency IMO. Just looking at the rendering chain. The SFS pipeline is longer than the normal pipeline because you need to add the SFS step every once in a while. I am with you on the instant/on demand discussion piece. There is hardware in there to resolve blending if you can't get your texture on time. So there are obvious limitations to what on demand means. But since most textures start at mip 10/12/13, the system has a chance to know how quickly to sample up and what parts of that texture you need all the way up through the chain until you change the angle in which you see the texture.
 
Last edited:
I don't disagree. ;) The only thing I've said against SF is it'll never be able to load textures on demand from a GPU request. It'll still be prefetching, but more accurately than without and a more efficient version of texture streaming. I have no idea what the real world gains will be, although I struggle to envision a situation where they are significant. SFS may provide a nice quality boost by softening transitions.

As for the need to have your texture available in a couple of frame times, the difference of 70ms or 100ms is a few frames; it's not a huge deal. If you are designing a system that'll cope okay with 100ms latency, a system that shaves 30 ms off that isn't going to enable anything different; it'll just produce a quality difference with a little less pop-in. It won't be a game changer as function theorises. SSDs are already the next game changer. We need another order of magnitude or two latency reduction over the next-gen SSDs to get the next game changer between DRAM and storage. Probably not even that (need to be way lower latency). The next step up will require storage access so fast it could be accessed on demand mid frame, which is DRAM/Optane level fast. Anything slower than that, you are still working with prefetching. It might mean less, but it's the same solution, the same 'game'.

It's on demand streaming because it is purely reactive and totally deterministic. You literally don't know what's needed until it is sampled.
Interestingly, Epic is trying to do this frustrum view culling through their engine to lower throughput requirement (see new video from unreal team). I think that MS hardware solution will end up leaving his type of software hack in the dust and as DX12U-compatible gpus become more widespread, so will SFS adoption.
 
we

well, you provided the information here that I think was being missed dramatically. So to be put in layman's words here.


Sampler Feedback Streaming (SFS): Sampler Feedback Streaming is a brand-new innovation built on top of all the other advancements of the Xbox Velocity Architecture. Game textures are optimized at differing levels of detail and resolution, called mipmaps, and can be used during rendering based on how close or far away an object is from the player. As an object moves closer to the player, the resolution of the texture must increase to provide the crisp detail and visuals that gamers expect. However, these larger mipmaps require a significant amount of memory compared to the lower resolution mips that can be used if the object is further away in the scene. Today, developers must load an entire mip level in memory even in cases where they may only sample a very small portion of the overall texture. Through specialized hardware added to the Xbox One X, we were able to analyze texture memory usage by the GPU and we discovered that the GPU often accesses less than 1/3 of the texture data required to be loaded in memory. A single scene often includes thousands of different textures resulting in a significant loss in effective memory and I/O bandwidth utilization due to inefficient usage. With this insight, we were able to create and add new capabilities to the Xbox Series X GPU which enables it to only load the sub portions of a mip level into memory, on demand, just in time for when the GPU requires the data. This innovation results in approximately 2.5x the effective I/O throughput and memory usage above and beyond the raw hardware capabilities on average. SFS provides an effective multiplier on available system memory and I/O bandwidth, resulting in significantly more memory and I/O throughput available to make your game richer and more immersive.

So that is what SFS does. This language is more straight forward however.
This was explained in the SFS direct X video, but it was still hard to understand.

For better clarification on what is attempting to be solved, with respect to streaming and virtual texturing tiles. You can see how many tiles are _not_ fully visible by the player. Most of them are -partial- tiles. But you have to load the whole mip anyway.

My visual interpretation of what SFS is doing vs standard Virtual Texturing. Looking at it this way does provide a lot of reason why many developers opted to not use Tiled Resources as the hardware function was locked to 64K size tiles IIRC. That's incredibly wasteful without SFS. A smaller tile will provide a better efficiency at the cost of other things.
YCcJx49.png




It's actually very clearly laid out in the SFS patent with some frankly ridiculously detailed use cases while patent legalese is often more tediously generalized. Actually surprised that MS was this specific in describing the scope of the patent.
 
I think their all the worst, but it's their job to sell a product, the ones we hate the most probably do their job the best. The average joe is going to fall for all this SSD hype like if it's the second coming. When that's exhausted they'l find something else to hype about.
The tone of his voice seems to trigger me. Nothing against the man himself. He's got a job to do after all.
 
Back
Top