Velocity Architecture - Limited only by asset install sizes

I think we're talking about different things. The way I read iroboto's post ("Right, memory that is allocated to the GPU may actually need to be allocated to the CPU, but because of the split pool, developers can't use it and need work arounds to reallocate memory to it.") was Xbox has formal memory addressing arbitration. This seems unlucky and would be contrary to performance needs. This is why I think something has been lost in translation.
that's my fault, I made the mistake that seems to be carrying through on this discussion. Brit has the right of it.
The official documentation refers to the pools as
a) Standard memory
b) GPU optimal memory

There's no official separation of CPU/GPU in any of their documentation. I got caught up in the text that indicates you cannot remap GPU optimal to standard pools without a lot of effort. And the recent patch notes indicating that they were reducing the standard pool further to give more space to the GPU optimal pool. That led me to believe for some reason they (developers) needed/requested more in the standard pool to make things work or wanted to have control of the amount mapped, and so my mind went to thinking the CPU perhaps did not have access to GPU optimal and I strayed the wrong path from there.
 
I'm missing something or this is the perfect tech for an antman's game, or both.
Typically virtual reality would desire exactly these types of traits in graphic technology since players can grab and bring objects directly to their face and appreciate the finer details that way. But yes, Antman would also work lol
 
that's my fault, I made the mistake that seems to be carrying through on this discussion. Brit has the right of it. .. .. .. There's no official separation of CPU/GPU in any of their documentation. I
Cool. Although he didn't go into huge detail Andrew Goossen did state "Memory performance is asymmetrical". I.e. it's one memory region, one bus, one path to memory but two clear configurations of RAM within that region. I can't recall this ever having been done before, at least not in a contiguous range. The Commodore Amiga did have had slow and fast RAM configurations, and the mapping was effectively contiguous but there was completely separate buses meaning you could access both RAM pools simultaneously at their respective full speeds. That's an expensive solution though.
 
Cool. Although he didn't go into huge detail Andrew Goossen did state "Memory performance is asymmetrical". I.e. it's one memory region, one bus, one path to memory but two clear configurations of RAM within that region. I can't recall this ever having been done before, at least not in a contiguous range. The Commodore Amiga did have had slow and fast RAM configurations, and the mapping was effectively contiguous but there was completely separate buses meaning you could access both RAM pools simultaneously at their respective full speeds. That's an expensive solution though.
yea it's quite neat. It does seem to be a single bus, but virtually mapped to 2 separate pools.
Allocating memory I guess requires a call to either pool, I suppose their is some checker to determine if there is enough memory available in that pool.
 
I think we're talking about different things. The way I read iroboto's post ("Right, memory that is allocated to the GPU may actually need to be allocated to the CPU, but because of the split pool, developers can't use it and need work arounds to reallocate memory to it.") was Xbox has formal memory addressing arbitration. This seems unlucky and would be contrary to performance needs. This is why I think something has been lost in translation.

That doesn't seem right. It's unified RAM its just that devs have to ensure that certain assets are mapped to the faster memory pool. Otherwise the CPU can access any memory addresses that the GPU can. The only disadvantage to this whole setup is either with developer tools having bugs or devs not taking the time to allocate objects to the right portions of memory. Brit gave examples of code that is used to do this.
 
As for how does Xbox One X games that use more than 10 GB gets mapped to memory on Xbox Series X, that's a more interesting question. A OneX enhanced game would have access to 12 GB.

I suspect (would need to go through hotchips and release notes again) that the program executable is placed in the slower pool (up to 3.5 GB for games) and then everything else is mapped to the 10 GB faster pool. I don't know if there is any limits on how large an "executable" can be, but I doubt any would be over 3.5 GB. Even if it is, I suspect you simply map the first 3.5 GB to slower memory and anything over that is mapped to the remaining 10 GB faster memory.
If I recall correctly, of the 12gb on the X1X, 9gb is addressable by the GAME while the remaining 3gb is reserved for the OS so an X1X game would map fully into the GPU pool of the XSX memory setup.
 
Ah, I had a slight think-o earlier. :oops:

BC for OneX enhanced games on SeriesX is even easier than I proposed. The OneX has 12 GB total while games have access to 9GB. So they could easily map everything to the faster memory addresses and be done with it. As others pointed out, the slower memory addresses are still faster than what's in the OneX, so even if a game relied on some system buffering, it would never offer lesser performance on SeriesX.
 
They have operations that can migrate buffers between the pools, but one operation does not function as desired and the performance of doing the workaround in GDK is slower than desired in certain situations. These are some of the issued listed in the leaked XDK Release Notes from 2020.

Remapping memory reservations from fast to slow memory fails when using XMemVirtualAlloc
  • Using XMemVirtualAlloc fails when trying to remap a memory reservation from fast memory to slow memory. You can work around this limitation by using XMemAllocatePhysicalPages or by using a completely different reservation for slow memory
XMemAllocatePhysicalPages performs significantly slower on Anaconda with the Advanced memory configuration than Standard memory configuration. This will be addressed in a future GDK.

I think its easy to get confused because AMD/MS/Sony don't provide a lot of context specifically in regards to how the underlying physical memory operates when servicing the virtual memory system.

"The CPU and GPU gets a complete view of memory" is a common perception. But what memory specifically? The virtual memory address space while the actual physical memory is abstracted away and still has a bunch of limitations that existed since the first AMD APUs? Those limitations are simply mitigated by hardware instead of the application (software) now? Like there are still regions of memory that aren't readily available to the CPU, but the data copying to overcome such issues is done by the hardware behind the scenes.
 
Last edited:
Cool. Although he didn't go into huge detail Andrew Goossen did state "Memory performance is asymmetrical". I.e. it's one memory region, one bus, one path to memory but two clear configurations of RAM within that region. I can't recall this ever having been done before, at least not in a contiguous range. The Commodore Amiga did have had slow and fast RAM configurations, and the mapping was effectively contiguous but there was completely separate buses meaning you could access both RAM pools simultaneously at their respective full speeds. That's an expensive solution though.

For a very brief moment this is what I thought Microsoft were doing after they first gave some of the specs on the system way back, until I realized it was virtually impossible, mainly because of the speed of the modules they have plus no GDDR6 exists with speeds or capacities (or pin counts) able to provide that.

Was a lot more naive about a lot of this stuff back then :S. Anyway, this is the reason why looking at older gaming hardware architectures is so interesting: usually we'll find some type of equivalent in newer techniques having predecessors in older hardware. The application, implementation and purposes may vary but some basic concepts seem to always stand the test of time, just seeing different iterations as newer designs come about.
 
Cool. Although he didn't go into huge detail Andrew Goossen did state "Memory performance is asymmetrical". I.e. it's one memory region, one bus, one path to memory but two clear configurations of RAM within that region. I can't recall this ever having been done before, at least not in a contiguous range. The Commodore Amiga did have had slow and fast RAM configurations, and the mapping was effectively contiguous but there was completely separate buses meaning you could access both RAM pools simultaneously at their respective full speeds. That's an expensive solution though.
After all it is still better than not having the extra memory. And should still be faster than a 256 bit bus with 16Gb of memory even when switching must be done. After all the biggest part of the slow-pool is reserved for the OS, so it shouldn't be used that frequent. But yes, on average the max memory speed won't be reached.
 
You are right that the challenge is probably just maximizing the memory, you are likely to have a much harder time filling it to the full allocation having now divide into two pools. Like a packing a single moving truck vs 2 smaller trucks. Tougher with big furniture etc.

Is it really that big a challenge. Flag data as either bandwith sensitive or not. When placing them in ram, fill the ram "from the borders, inward". If the non-bw-sensitive data ends up spilling into fast ram, no harm is done. If some portion of the bw-sensitive data spills into the slower portion, the game will still work. Just that part that is not on the fastest ram will perform a little worse. The speed difference is not even that huge. Its already faster than anything on PS4/XBONE. It will be ok, guys.
 
You are right that the challenge is probably just maximizing the memory, you are likely to have a much harder time filling it to the full allocation having now divide into two pools. Like a packing a single moving truck vs 2 smaller trucks. Tougher with big furniture etc.

I think the main issue is having data that needs the faster memory bandwidth in the wrong RAM pool. Otherwise I don't think they'd have issues filling up memory if they're using 4k textures and highly complex geometry data.
 
That doesn't seem right. It's unified RAM its just that devs have to ensure that certain assets are mapped to the faster memory pool. Otherwise the CPU can access any memory addresses that the GPU can. The only disadvantage to this whole setup is either with developer tools having bugs or devs not taking the time to allocate objects to the right portions of memory. Brit gave examples of code that is used to do this.

Is it your understanding that the CPU and GPU cannot access the same memory? That seems contrary to the fundamental design of a unified memory architecture. What is it that makes you think this is the case?

Am I misunderstanding you?
 
Is it really that big a challenge. Flag data as either bandwith sensitive or not. When placing them in ram, fill the ram "from the borders, inward". If the non-bw-sensitive data ends up spilling into fast ram, no harm is done. If some portion of the bw-sensitive data spills into the slower portion, the game will still work. Just that part that is not on the fastest ram will perform a little worse. The speed difference is not even that huge. Its already faster than anything on PS4/XBONE. It will be ok, guys.
I mean, I guess I'm just thinking about space fragmentation in memory, you've got 200MB left in each pool but the next set of assets requires 250MB for instance, neither pool can hold it and now you're playing tetris to make things work. It's a slightly different argument from bandwidth, and this problem doesn't exist with a unified setup. So if you need to make space, the question is how and what performance impacts will the system suffer for making space. Or you can forgo it and try to load more things just in time. (which brings us back to the argument around why velocity architecture is needed).
 
Last edited:
I mean, I guess I'm just thinking about space fragmentation in memory, you've got 200MB left in each pool but the next set of assets requires 250MB for instance, neither pool can hold it and now you're playing tetris to make things work.
It's also a bit of head scratcher to work out if things will get better or worse with the partial asset technologies in the new consoles. It should ease the absolute pressure on memory but almost certainly complicates predicting how much memory may be required in any given instance because you're not longer loading an entire texture, except on some cases you are. It's all variable.
 
It's also a bit of head scratcher to work out if things will get better or worse with the partial asset technologies in the new consoles. It should ease the absolute pressure on memory but almost certainly complicates predicting how much memory may be required in any given instance because you're not longer loading an entire texture, except on some cases you are. It's all variable.
yea it seems like a tradeoff for sure, but it's likely a big stress on I/O calls etc.
IIRC UE5 demo only had a streaming pool of 768 MB. That's quite small all things given. That' means it's really focusing on the SSD to drive the data in, that would increase the traffic and the pressure on the code to continually check to bring in the right assets.
 
I mean, I guess I'm just thinking about space fragmentation in memory, you've got 200MB left in each pool but the next set of assets requires 250MB for instance, neither pool can hold it and now you're playing tetris to make things work. It's a slightly different argument from bandwidth, and this problem doesn't exist with a unified setup. So if you need to make space, the question is how and what performance impacts will the system suffer for making space. Or you can forgo it and try to load more things just in time. (which brings us back to the argument around why velocity architecture is needed).

Yes, I understan that. But I assume there isn't a game enfine out there at this point that doesn't have a system to manage data streaming and defragmentaion. To adapt it to a 2 speeds pool, one sets it so it pushes fast data into the fast edge, and slow into the slow edge, and let the middle fall where it may. In theory its simple, but of course in practice it's full of gotchas, but it's probably not much more of a headache than it already is.
 
Not to sidetrack the ongoing conversation but the velocity architecture is really holding up well. On Hitman 3 the load times are roughly 7 seconds on the XSX and PS5. This isn't even a BC game but a real native crossgen game. Could be better DMA controllers in the XSX and file I/O libraries? The PS5 should be loading twice as fast. And I don't think it's simply because of the higher clocked CPU in the Series X but if anyone knows better they could chime in.
 
Not to sidetrack the ongoing conversation but the velocity architecture is really holding up well. On Hitman 3 the load times are roughly 7 seconds on the XSX and PS5. This isn't even a BC game but a real native crossgen game. Could be better DMA controllers in the XSX and file I/O libraries? The PS5 should be loading twice as fast. And I don't think it's simply because of the higher clocked CPU in the Series X but if anyone knows better they could chime in.

Maybe this is not fully optimized. Loading are slow in Hitman 3 around 6 to 7 seconds compared to Spiderman Miles Morales for example where it is around 2 seconds. In an optimized title on PS5 the CPU is not involved into I/O at all.
 
Back
Top