Velocity Architecture - Limited only by asset install sizes

On the SeriesX The CPU and GPU can reference data from anywhere,
is this confirmed? Just reading through the docs, I was sort of in contention about this one (originally assuming the CPU was allowed to access whatever it wanted as well)

I'll have to check hot chips, but I think that was an assumption we have traditionally made, but I haven't found hard proof of it. It's open to interpretation though so far until then.

edit: wait. Yea you're right. CPU being in charge of IO, it's making the call to move data into memory somehow. Yea it's got full access to both pools.

You are right that the challenge is probably just maximizing the memory, you are likely to have a much harder time filling it to the full allocation having now divide into two pools. Like a packing a single moving truck vs 2 smaller trucks. Tougher with big furniture etc.

As for variable clocks, I think the reason XSX is fixed is because of the way it's trying to run more than 1 virtualized environment on Azure. It can run 4 Xbox One instances at the same time for instance. Variable clocks would just mess that up entirely.
 
Last edited:
On the SeriesX The CPU and GPU can reference data from anywhere, its just the maximum speed at which the data can be referenced is different (560 vs 336). The CPU has an upper bound far lower than the slower memory pool anyways.

But the devs need to be aware of where the memory they're using is at if its frequently used by the GPU.

Entirely different story on PC with phycally separate memory pools. Thiugh some of that is slightly mitigated by the Resizable BAR implementations, but its still slower than if the memory was directly accessible by both CPU and GPU.

I think what's being referred to isn't even really related to bandwidth, of how the data for the GPU pool of the memory is actually getting into that pool. Since the CPU only has access to 6 GB from 6 of the 10 chips (and of that 2.5 GB is reserved for OS anyway), I've seen that as basically a 3.5 GB physical space of RAM to shuffle data for the GPU memory through, but realistically it would be even "smaller" because some of that 3.5 is going to also be used for CPU code and audio data. At least, if the way I'm seeing it there is close to accurate.

Devs are likely aware of where the data is in memory once it's populated, it's more the question of how quickly that data can actually get to where it needs between the two pools. Once the GPU data is in the GPU memory pool it's basically smooth sailing but the differing pools of size & bandwidth might be creating some growing pains for devs in actually getting GPU-bound data into the GPU part of the memory pool. All that aside though compared to PC an APU design like Series X still has a clear advantage over what PC does with stuff like BAR or shuffling data over a PCIe bus, since it's still essentially a hUMA design in other aspects. This partitioning of memory pools into fast & slow blocks which also affects the physical capacity of both, feels like it kind of virtualizes some nUMA quirks into the package though, at least IMO.

edit: ignore this reply lol btw. it's completely based on wrong information.

Right, memory that is allocated to the GPU may actually need to be allocated to the CPU, but because of the split pool, developers can't use it and need work arounds to reallocate memory to it. The way memory is mapped, developers can't just freely use what's available without planning or designing around it. Instead of filling a single bucket, now they got two separate buckets to play with. Most of the earlier discussion around the disadvantages around split pool tended to formulate around the 'average' bandwidth between two pools, but size considerations were never really discussed (and honestly, without access to documentation I wouldn't have suspected this either)

If the GPU could directly access storage (in practice), would that resolve a lot of this? That's what DirectStorage is supposed to help with: directly accessing data in the storage to populate VRAM, bypassing system RAM and copy process. So we know the Series X is capable of this. However, it's also a part of Velocity Architecture and DirectStorage won't start deploying until later this year. If the VA timescale is to what you're speculating then this feature probably won't even be leveraged for a while even if the hardware is capable of it.

Which might be a bit of an issue going forward until it can actually be leveraged, but I guess we'll have to wait and see.

The issue is likely going to be cropping up the most during the transition period of games. These are games that still traditionally use the CPU for a lot of GPU functions, like culling, animation etc. So if the CPU needs to access these memory locations, this information needs to sit in the slow pool for it to do it's updates etc. I think the traditional thought process here is that GPU needs a huge amount of memory, which it may in the future, but as of this moment with cross gen, you may not see so much budget being placed towards super high quality assets, so the amount of VRAM required by the GPU may be lower, like 7-8 GB. And the CPU may use the rest. But with Series X|S you are locked with how much you have in both areas, so you careful planning on how to do it which is difficult when you need to make considerations for last generation.

Sounds about right, also comes with some inconvenient timing for MS tho I would guess. Again I think this is a reason they scaled back so much on their own 1P cross-gen support; if some of the features critical to VA can't really be leveraged by games still being coded to 8th-gen consoles as their base, then the only way to break out of that early is to cut off 8th-gen support. Which for Microsoft in particular, I always felt was the better option for a lot of reasons.

But then they also have to consider their Xbox/PC/mobile (through streaming) cross-platform initiative though and that kind of acts as a bit of an anchor in this because solutions like DirectStorage won't even be supported among most PCs except in virtually-impossible-to-buy RDNA2 and RTX 30 series GPUs, which won't make up a significant chunk of that market gaming-wise for a very long time. So while they could push their 1P ahead to focus on the 9th-gen and accelerate use of VA and its features if they'd like or needed, that still probably creates some lag for the PC side of things.

Altho I don't want to sound like I'm putting all of it on VA myself; seems that utilization of VA isn't potentially at the heart of things regarding certain performance metrics in various 3P games on the system as of this moment.

This may explain why there are random major drops on XSX with some of these launch titles. They simply ran out of room on the CPU or GPU side and needed to perform some terrible workaround to get it all to fit. ie, relying on the SSD to stream level/animation/vertices data in for new monsters etc while slowly unloading parts of the level out.

That being said however, the most critical features for GPU dispatch are included in the older generations of hardware (at least for Xbox One it is confirmed and for PS4 sort of assumed), so it's really about re-writing their rendering pipelines as PC is holding them back in this regard.

Yeah, and if it's a game that needs to also work on 8th-gen platforms any sort of SSD-level streaming of data (particularly non-texture data) needs a mirrored equivalent for the older systems and their HDD I/O being much more limited. I know games like Until Dawn got a recent update to significantly cut loading times on PS4, but loading data isn't the same thing as streaming it in, and this highlights that.

And like you also say, PC would be a limiting factor as well, partly because stuff like DirectStorage and GPUDirectStorage are either too niche to really program around, or simply aren't even available to use yet.

In the end it's only speculation, but just my thoughts on the performance on Series X|S so far. If (or rather) the games that can get to/using GPU driven pipelines: animation, vertex, culling, ordering etc, can all be performed by the GPU, improving the bandwidth by moving that particular data to GPU optimal memory, removing it from the standard memory pool, and freeing the CPU up to do other things or do a better job at holding higher framerates, working on AI or processing other things.

It's not Velocity Architecture that needs to be adopted really. I mean, that's one way to attack the issue, but that's a texturing solution. What about mesh information? Animation information? What needs to be addressed is the move to GPU driven rendering pipelines.

So it's more basically coming down to reliance by cross-gen software to rely on CPU for pipelines related to rendering setup (is this another way of phrasing drawlists/drawcall instructions? XBO had support for executeIndirect which might be some of the GPU-orientated task support in 8th-gen systems you were suggesting), and in that scenario a fully unified pool is always going to win out.

With something like Series X there's just a large chunk of physical memory exclusively set for a processor component that these cross-gen games aren't even designed to leverage for that type of task work and given the CPU speeds between Sony and MS's systems are virtually identical in SMT mode, that's going to affect MS moreso since the lesser physical RAM amount (and therefore bandwidth) dedicated to the CPU in their setup is magnified with no additional CPU clockspeed to power through that and make up the difference (again, in SMT).

Some or all of this stuff I figure also applies when it comes to BC games too, but in that case games can rely on non-SMT clock for the CPU plus those games using less physical memory, XBO/OneX games not running as fast on the CPU side of things in the first place and probably several other things I don't understand the intricate details on to talk about.
 
This probably is using velocity architecture.. given how close they can zoom up.
Pretty insane materials/budget setup. I guess they can keep things cheap if they keep the entire game to that condo.
 
Last edited:
I think what's being referred to isn't even really related to bandwidth, of how the data for the GPU pool of the memory is actually getting into that pool. Since the CPU only has access to 6 GB from 6 of the 10 chips (and of that 2.5 GB is reserved for OS anyway), I've seen that as basically a 3.5 GB physical space of RAM to shuffle data for the GPU memory through, but realistically it would be even "smaller" because some of that 3.5 is going to also be used for CPU code and audio data. At least, if the way I'm seeing it there is close to accurate.

Wrong. The CPU has access to ALL memory, all 16 GB. Same with GPU, it has access to ALL memory, all 16 GB. A game has access to 13.5 GB that it can use on Series X. The GPU and CPU can access all 13.5 GB.
 
Wrong. The CPU has access to ALL memory, all 16 GB. Same with GPU, it has access to ALL memory, all 16 GB. A game has access to 13.5 GB that it can use on Series X. The GPU and CPU can access all 13.5 GB.

But that's access as in copying data to/from/within the RAM, right? Not access as in the CPU working with data in the full 16 GB, otherwise why designate CPU/audio-optimized pool and GPU-optimized memory pool?

Because the former, there's no disagreement or point of curiosity there. The latter, though, at least IMO it is opening up some discussion on how if cross-gen games are not optimizing allocation of GPU-bound data to fit in the GPU-optimized 10 GB pool, and the games are still doing certain tasks that haven't been transitioned to a GPU-friendly pipeline, if those games are just barely using parts of VA to "make up" for that, it could explain pretty well some of the discrepancies in 3P cross-gen performance on the platform for these early months.
 
But that's access as in copying data to/from/within the RAM, right? Not access as in the CPU working with data in the full 16 GB, otherwise why designate CPU/audio-optimized pool and GPU-optimized memory pool?

It's a single memory pool. It's full access, to do with whatever you please. Only the speed of the memory is different.

It would be suboptimal to place data in the faster speed section if it's only ever accessed by the CPU.
It would be suboptimal to place data in the slower speed section if it's only ever accessed by the GPU.

Not everything will be that clear and easily classified. It's a bit of a balancing act in where to place the data if it's used by CPU and GPU. It all depends on how the data is used, as in how often does the GPU and CPU touch it.
 
As for how does Xbox One X games that use more than 10 GB gets mapped to memory on Xbox Series X, that's a more interesting question. A OneX enhanced game would have access to 12 GB.

I suspect (would need to go through hotchips and release notes again) that the program executable is placed in the slower pool (up to 3.5 GB for games) and then everything else is mapped to the 10 GB faster pool. I don't know if there is any limits on how large an "executable" can be, but I doubt any would be over 3.5 GB. Even if it is, I suspect you simply map the first 3.5 GB to slower memory and anything over that is mapped to the remaining 10 GB faster memory.
 
Is it possible for devs to treat the entirety of the ram as slow ram? If they were struggling with allocating things between the fast and slow pools they might just treat the whole thing as a unified pool. Which might be why there have been some performance issues on some multi platform games, the devs may not have had enough time to get familiar with it the first go round.
 
They have operations that can migrate buffers between the pools, but one operation does not function as desired and the performance of doing the workaround in GDK is slower than desired in certain situations. These are some of the issued listed in the leaked XDK Release Notes from 2020.

Remapping memory reservations from fast to slow memory fails when using XMemVirtualAlloc
  • Using XMemVirtualAlloc fails when trying to remap a memory reservation from fast memory to slow memory. You can work around this limitation by using XMemAllocatePhysicalPages or by using a completely different reservation for slow memory
XMemAllocatePhysicalPages performs significantly slower on Anaconda with the Advanced memory configuration than Standard memory configuration. This will be addressed in a future GDK.
 
They have operations that can migrate buffers between the pools, but one operation does not function as desired and the performance of doing the workaround in GDK is slower than desired in certain situations. These are some of the issued listed in the leaked XDK Release Notes from 2020.

Remapping memory reservations from fast to slow memory fails when using XMemVirtualAlloc
  • Using XMemVirtualAlloc fails when trying to remap a memory reservation from fast memory to slow memory. You can work around this limitation by using XMemAllocatePhysicalPages or by using a completely different reservation for slow memory
XMemAllocatePhysicalPages performs significantly slower on Anaconda with the Advanced memory configuration than Standard memory configuration. This will be addressed in a future GDK.

Havent been able to read everything you've posted thoroughly but it makes the most sense. In essence its unified memory to the CPU/GPU but with slower and fast access to certain physical RAM. And So the XMemVirtualAlloc is a library function that remaps virtual RAM addresses of objects or primitives to faster or slower addresses. And the XMemAllocatePhysicalPages maps the actual physical locations in memory? So is it a case of developers are not using these functions properly in addition to inefficiencies in their implementations? or Its simply just the inefficiencies in their implementations that are causing issues and once sorted by MSFT devs, the game code will function much better?
 
Is it possible for devs to treat the entirety of the ram as slow ram? If they were struggling with allocating things between the fast and slow pools they might just treat the whole thing as a unified pool. Which might be why there have been some performance issues on some multi platform games, the devs may not have had enough time to get familiar with it the first go round.

I just think that would bottleneck the whole system. The memory controller is built to handle data using the split memory pools. Better to fully utilize that. But again if your data ends up in a slower memory pool you're not going to get the full performance. So its a catch 22 until they get their software in order.
 
As for how does Xbox One X games that use more than 10 GB gets mapped to memory on Xbox Series X, that's a more interesting question. A OneX enhanced game would have access to 12 GB.

I suspect (would need to go through hotchips and release notes again) that the program executable is placed in the slower pool (up to 3.5 GB for games) and then everything else is mapped to the 10 GB faster pool. I don't know if there is any limits on how large an "executable" can be, but I doubt any would be over 3.5 GB. Even if it is, I suspect you simply map the first 3.5 GB to slower memory and anything over that is mapped to the remaining 10 GB faster memory.

That's probably what they're doing, given that even if One X has 12 GB, not all of that is used for the game code.

In fact I think only 9 GB is used for games as the other 3 is used for the OS (even if they bumped it to 9.5 GB or 10 GB for games that would still fit in the 560 GB/s memory pool), so the entirety of the game code can go in the GPU-optimized pool and stay there.
 
That's probably what they're doing, given that even if One X has 12 GB, not all of that is used for the game code.

In fact I think only 9 GB is used for games as the other 3 is used for the OS (even if they bumped it to 9.5 GB or 10 GB for games that would still fit in the 560 GB/s memory pool), so the entirety of the game code can go in the GPU-optimized pool and stay there.

The game code most likely will be in the slower pool of RAM( should be in static memory) and objects created at runtime should be stored for the most part in the game optimal RAM(10GB) So textures, geometry, whatever.
 
As for how does Xbox One X games that use more than 10 GB gets mapped to memory on Xbox Series X, that's a more interesting question. A OneX enhanced game would have access to 12 GB.

I suspect (would need to go through hotchips and release notes again) that the program executable is placed in the slower pool (up to 3.5 GB for games) and then everything else is mapped to the 10 GB faster pool. I don't know if there is any limits on how large an "executable" can be, but I doubt any would be over 3.5 GB. Even if it is, I suspect you simply map the first 3.5 GB to slower memory and anything over that is mapped to the remaining 10 GB faster memory.

Considering the slower memory pool still has higher memory bandwidth than the One X I don't think they need to update the game code to fully utilize the faster and slower pools of RAM. The games could simply run on the Series X and have much better performance.
 
On the SeriesX The CPU and GPU can reference data from anywhere, its just the maximum speed at which the data can be referenced is different (560 vs 336). The CPU has an upper bound far lower than the slower memory pool anyways.

Something is being lost in translation. The big advantage of modern console architectures is the CPU and GPU are on the same die and both have access to the unified pool of RAM. I can't think of any compelling reasons to implement some arbitrary 'allocation' of RAM ranges of RAM to CPU or GPU.
 
Something is being lost in translation. The big advantage of modern console architectures is the CPU and GPU are on the same die and both have access to the unified pool of RAM. I can't think of any compelling reasons to implement some arbitrary 'allocation' of RAM ranges of RAM to CPU or GPU.
The ranges will be associated to the speed.
Just calling it gpu and cpu as that's the main uses of them.
 
Something is being lost in translation. The big advantage of modern console architectures is the CPU and GPU are on the same die and both have access to the unified pool of RAM. I can't think of any compelling reasons to implement some arbitrary 'allocation' of RAM ranges of RAM to CPU or GPU.
Well, memory fragmentation would be one. The GPU data is mostly really short living (at least in the future). This part has a higher chance to create many fragments.
 
Well, memory fragmentation would be one. The GPU data is mostly really short living (at least in the future). This part has a higher chance to create many fragments.
I think we're talking about different things. The way I read iroboto's post ("Right, memory that is allocated to the GPU may actually need to be allocated to the CPU, but because of the split pool, developers can't use it and need work arounds to reallocate memory to it.") was Xbox has formal memory addressing arbitration. This seems unlucky and would be contrary to performance needs. This is why I think something has been lost in translation.
 
Back
Top