Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

jlippo · Mar 31, 2020

iroboto said:
Sure and what happens when you're not looking at oblique textures? Have you done the math on what it takes to render 16K textures? Or you just assuming it's capable because I/O isn't the limiter anymore?

Can someone else provide some thoughts here? I don't want to be dismissive, but I have found that texture performance for a game is mainly a factor of I/O, Memory Capacity and available Memory bandwidth - which the latter 2 will be severely lacking with respect to the boost we got in I/O

Some metrics on movie scene quality here:
source: http://theillusionden.blogspot.com/2016/03/
Film characters and models can have 8k+ texture maps or even hundreds of 4-8k maps per model

“Almost every asset rendered by Weta for Avatar was painted to some extent in MARI. A typical character was around 150 to 170 patches, with 30 or more channels (specular, diffuse, sub-surface, etc) and 500k polygons at Sub-division level one. The full texture set ran to several tens of gigabytes, all of which could be loaded in MARI at the same time. The biggest asset I saw being painted was the shuttle, which came in at 30Gb per channel for the fine displacement detail (500, 4K textures). Assets of over 20M polys can be painted.” - Jack Greasley

You think PS5 is capable of this in real time because of their SSD?

SSD without proper streaming wouldn't. (Loading highest 8k mipmap for an arrowhead on faraway shelf.)

A proper virtual texturing and such streaming methods with SSD and it should be fine.
A viewport size and amount of different texture layers will more or less determine the size of texture atlas needed and it's memory use is constant.

Afterall huge textures are rarely completely in view and there is always places where mipmaps can be used.
Biggest texture in Rage was 128kx128k and it managed to run on X360/Ps3 although it understandably had some serious texture popping, due to drive performance.

chris1515 · Mar 31, 2020

Silent_Buddha said:
That doesn't make sense as even when occupying the slower portion of memory, that bandwidth is still significantly higher than what a desktop CPU has access to.

There's really no reason to use any of the "fast" memory allocation for the OS.

Regards,
SB

But each time you use the slower memory you diminish the total amount of bandwidth. This an example, If you use half a second the slow memory(336 GB/s) and half a second the fast memory(560 GB/s) at the end you have 448 GB/s of bandwidth.

I understand that OS functionnality which are often accessed can be in the fast memory to have better bandwidth and functionnality rarely used in the slower memory.

EDIT: I was like you saying why use fast memory for OS this is stupid but at the end the answer is logic.

Silent_Buddha · Mar 31, 2020

chris1515 said:
But each time you use the slower memory you diminish the total amount of bandwidth. This an example, If you use half a second the slow memory(336 GB/s) and half a second the fast memory(560 GB/s) at the end you have 448 GB/s of bandwidth.

I understand that OS functionnality which are often accessed can be in the fast memory to have better bandwidth and functionnality rarely used in the slower memory.

EDIT: I was like you saying why use fast memory for OS this is stupid but at the end the answer is logic.

So, MS are lying when they say the full 10 GB of "fast" memory allocation is for games?

I still see no valid reason why the OS at any point would require more than the bandwidth available for the "slow" memory allocation. I'd be surprised if at any point the OS would use more than 100 GB/s of memory bandwidth, much less 500+.

Ryzen 7 3700x and 3900x can't even read from memory that quickly (10's of GB/s not 100's). Does that mean that Windows PCs using those CPUs have slow OS response? Is the OS on XBSX going to be doing something that is impossible on PC? Again we're talking about the OS here and not games. My expectations is that OS will be doing significantly less than a PC OS. It's not like someone will be opening up a large Photoshop project on XBSX or doing massive database searches (in an Azure datacenter perhaps, but not in XBSX).

Regards,
SB

chris1515 · Mar 31, 2020

Silent_Buddha said:
So, MS are lying when they say the full 10 GB of "fast" memory allocation is for games?

I still see no valid reason why the OS at any point would require more than the bandwidth available for the "slow" memory allocation. I'd be surprised if at any point the OS would use more than 100 GB/s of memory bandwidth, much less 500+.

Ryzen 7 3700x and 3900x can't even read from memory that quickly (10's of GB/s not 100's). Does that mean that Windows PCs using those CPUs have slow OS response? Is the OS on XBSX going to be doing something that is impossible on PC? Again we're talking about the OS here and not games. My expectations is that OS will be doing significantly less than a PC OS. It's not like someone will be opening up a large Photoshop project on XBSX or doing massive database searches (in an Azure datacenter perhaps, but not in XBSX).

Regards,
SB

It is a balance if OS is fully in slower RAM, you lose more bandwidh

https://www.resetera.com/threads/pl...ve-ot-secret-agent-cerny.175780/post-30333499

The CPU and GPU share the same bus to the same pool of RAM on both the Series X and PS5. There's no way around that. Only when the CPU is doing literally nothing can the GPU utilize the full bandwidth on either one (except it won't have any work to do because the CPU is what queues up work, so that's quite literally never going to happen.)

As I stated above, if you assume the CPU needs the same 48GB/s on both, then you have 400GB/s remaining on the PS5 or about 39.1GB/s/TF for the 10.23TF PS5 (which is a pretty meaningless measure but it's what you're considering here.) On the Series X you have 480GB/s left or about 39.7GB/s/TF for the 12.1 TG CPU. They're in pretty much the same shape at these rates.

The reason why 48GB/s of bus traffic for the CPU on Series X costs you 80GB/s of the theoretical peak GPU bandwidth is because it's tying up the whole 320-bit bus to transmit only 192 bits of data per cycle from the slower portion of RAM we've been told will be typically used by the CPU. 48GB / 192 bits * 320 bits = 80GB of effective bandwidth used to make that 48GB/s available. This is because only six of the ten RAM chips can contribute to that additional 6GB over and above the 10GB of RAM that can be accessed more quickly when all ten are used in parallel.

This is not the number of GB/s, this is how often you need to access the data. Maybe everything from OS is in slow memory but it cost more bandwidth. No solution is perfect.

Lurkmass · Mar 31, 2020

iroboto said:
CUDA has dominated this industry, making nvidia the main card of choice as the result. I hope that we see these libraries expand to OpenCL.

CUDA dominated because of failed politics inside the Khronos Group who are the same people that standardized OpenCL so don't hope for much since most vendors don't actually care about it.

Many of us also have Macs and others etc, I would like to consider buying a AMD GPU if it supports more ML libraries without the pain (or I guess I could i get to trying to code my own algorithms)

It's not like Mac users have any say when Apple deprecated support for Nvidia hardware so their unsupported GPUs are effectively paperweight unless they use bootcamp. Also it depends on what exactly you mean by "ML libraries" since inferencing can be done on any hardware with the likes of Tensorflow Lite.

If you're expecting to train some models with full featured Tensorflow or PyTorch then no amount of coding on Mac will get you anywhere since they lack the proper APIs to do this. Tensorflow's CUDA kernels uses C++ templates which is a feature that's not available on Apple's latest and greatest Metal API. There's even a ROCm port of Tensorflow to run on AMD hardware which uses their HIP API instead of OpenCL to solve the lack of C++ features as commonly seen on many PC APIs.

Proelite · Mar 31, 2020

https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs

In terms of how the memory is allocated, games get a total of 13.5GB in total, which encompasses all 10GB of GPU optimal memory and 3.5GB of standard memory. This leaves 2.5GB of GDDR6 memory from the slower pool for the operating system and the front-end shell. From Microsoft's perspective, it is still a unified memory system, even if performance can vary. "In conversations with developers, it's typically easy for games to more than fill up their standard memory quota with CPU, audio data, stack data, and executable data, script data, and developers like such a trade-off when it gives them more potential bandwidth," says Goossen.

chris1515 · Mar 31, 2020

Proelite said:
https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs

At least it is clear all CPU call in slower memory, only lowering bandwidth and not place. This is an ok compromise.

BRiT · Mar 31, 2020

Sheesh, people trying to make up shit for page hits dont even do basic home work to get things correct when they've already been clearly stated by Microsoft and DigitalFoundry.

We even had it listed in the system reservations thread with a direct link to the source. Now it's been quoted inlined explicitly.

iroboto · Mar 31, 2020

Lurkmass said:
CUDA dominated because of failed politics inside the Khronos Group who are the same people that standardized OpenCL so don't hope for much since most vendors don't actually care about it.

It's not like Mac users have any say when Apple deprecated support for Nvidia hardware so their unsupported GPUs are effectively paperweight unless they use bootcamp. Also it depends on what exactly you mean by "ML libraries" since inferencing can be done on any hardware with the likes of Tensorflow Lite.

If you're expecting to train some models with full featured Tensorflow or PyTorch then no amount of coding on Mac will get you anywhere since they lack the proper APIs to do this. Tensorflow's CUDA kernels uses C++ templates which is a feature that's not available on Apple's latest and greatest Metal API. There's even a ROCm port of Tensorflow to run on AMD hardware which uses their HIP API instead of OpenCL to solve the lack of C++ features as commonly seen on many PC APIs.

yea I get that. It’s not something that’s going to happen over night. But someone is buying AMD hardware; over time with a large enough population ideally people may build more libraries for it.

as for now; yes I’m stuck on nvidia if I want to stay with high level libraries that have GPU support.

Deleted member 13524 · Mar 31, 2020

Kaotik said:
(as side note, I know everyone keeps referring to them as 64-bit memory controllers, but are they really that? At least certain AMD slides suggest they're actually 16-bit controllers, which also fits the fact that GDDR6 uses 16-bit channels (other option would be 64-bit split into 4x16 "virtual" memory controllers but why list them as 16 separate MCs (for 256bit controller) then?)

When you look at chip x-rays from older AMD APUs/GPUs, it looks like the PHYs they use are all 64bit wide. Or maybe they're all 32-bit ones placed in pairs, side-by-side.
GDDR6 could change that, though.

manux said:
PS5 should have fixed power budget which sony can optimize against. The remaining question is ambient temperature. For ambient temp sony can probably design against specific maximum temp and just say console will not work in hotter than let's say 40C.

Is 40ºC even a realistic proposition? Who's playing videogames at that room temperature?

RobertR1 said:
AVX has been around since Sandy Bridge and is used in PC games already. It's not a new thing.

AVX256 only came with Haswell I think.

RobertR1 said:
AVX512 doesn't exist on mainstream CPU's.

Ice Lake U is pretty mainstream IMO, though I don't even know if it's usable (the CPU-Z AVX512 test crashes my Ice Lake laptop..)

TheAlSpark said:
This indicates 12MB for the 4900HS.
https://www.notebookcheck.net/AMD-Ryzen-9-4900HS-Processor-Benchmarks-and-Specs.454860.0.html

Weird. AMD site indicates 8MB.
https://www.amd.com/en/products/apu/amd-ryzen-9-4900hs

Whoops! Maybe they accidentally added the L2 cache (4MB).

Aah, AMD's infamous Game Cache!

VitaminB6 said:
That's the problem, what is most of the time? I know the majority of time spent on my console is probably watching Hulu/Netflix .

I don't think Cerny would make a presentation that addresses game developers where he claims the typical clocks are reached when the console is running non-gaming apps.
Besides, most likely the CPU and GPU will lower its clocks and voltage (and disable CUs since the APU has that ability) like hell when watching Netflix.

ultragpu said:
During a multiplatform gameplay using XSX rendering at native 4k for comparison, if PS5 stays at 1800p most of the time then it must mean the gpu clock is heavily dropped and 9.2TF is most likely the standard number since there's a 44% pixel difference between 2160p and 1800p. If it stays at ~2000p then it would be less than 20% in pixel difference and Cerny would be correct after all.

Assuming this multiplatform game will be running the exact same game (shader complexity, asset size, etc.).
Which may not always be the case, especially with the large advantage in I/O performance on one side, and memory bandwidth advantage on the other.

function said:
It's certainly possible for PS5 to lack VRS - these are customised solutions. For example, MS added int8 and int4 to their GPU. There's also the matter of licensing. If RNDA2 uses a MS patented implementation of VRS, it's possible that won't translate to PS5.

VRS next gen is likely to be as widespread as depending on SSDs to stream the games' assets.
nvidia and Intel graphics have had it for a while, and RDNA1 is the odd duck here. Most probably, VRS is part of RDNA2 and Microsoft is implementing their tweaked, customized version that perhaps offers better performance and/or is more flexible than AMD's.

shiznit said:
If fixed clocks of 2.0 GPU and 3.0 CPU were hitting a power wall, how is it possible for the variable approach to constantly keep both near 2.23 and 3.5? Doesn't add up.
(...)
I get what you're saying but it still doesn't make sense to me, unless you're referring to absolute worst corner case. Say clocks of 2.0/3.0 were consuming 200W under typical load, how is 98% of 2.23/3.5 going to consume less? They are claiming to be able to sustain higher clocks than the fixed version.

Silenti said:
What happens in 3 years time when developers are really starting to push the consoles and that "load" goes up?

It seems there's this idea that pushing the hardware to provide better visuals is inherently going to push more power from the chip, which isn't true.

Someone here already mentioned furmark, and that's a great example, along with e.g. OCCT.
Nowadays graphics card drivers will automatically limit the clocks if certain loads (like furmark) are detected, yet even then the card consumes more power.
For example, a 2080 Ti that reaches 1950MHz in an Unity graphics benchmark is limited to 1500MHz if it's running furmark, and in both cases the driver is always trying to touch the TDP limit by boosting the clocks as much as possible. And even at the downclocked 1500MHz frequency the card is running hotter than if it was running regular gaming loads.

It's not like furmark looks good. The code is above 15 years old AFAIK. It's just a type of code that cycles a lot through certain a part of the rendering pipeline, making the chip's power consumption and heat output reach very high levels.

BRiT said:
That would be absolutely horrible for early adopters especially if a devkit offered better performance than retail units, so the devs think they're offering a great experience but a large subset of their consumers suffer.

So you're saying the One X devkits with extra 4 CUs enabled are horrible?

dskneo said:
The Mhz number is no longer what determines performance for PS5.

Was it ever for anything else?

iroboto · Mar 31, 2020

jlippo said:
SSD without proper streaming wouldn't. (Loading highest 8k mipmap for an arrowhead on faraway shelf.)

A proper virtual texturing and such streaming methods with SSD and it should be fine.
A viewport size and amount of different texture layers will more or less determine the size of texture atlas needed and it's memory use is constant.

Afterall huge textures are rarely completely in view and there is always places where mipmaps can be used.
Biggest texture in Rage was 128kx128k and it managed to run on X360/Ps3 although it understandably had some serious texture popping, due to drive performance.

Yea I get that. I guess I'm thinking about whether it's possible to have high enough texture resolution that the texture is never stretched and is 1x1 with your native resolution at closest camera for all objects. Yes you're only loading in portions of it with virtual texturing. And yes I'm sure we can find a way to load in and out things effectively but can we do that and you still have the benefits of the fast loading with the benefits of the 'instant turn around load everything' with the no pop-in. And the super speed etc.

It just seems like, if the only limitation to texture size is the footprint it leaves on the hard drive, I'm surprised that this wasn't resolved a long time ago.

I get where textures are a big part of graphics, the more detail they contain the better everything looks. It's just the way it is.

RobertR1 · Mar 31, 2020

ToTTenTranz said:
Ice Lake U is pretty mainstream IMO, though I don't even know if it's usable (the CPU-Z AVX512 test crashes my Ice Lake laptop..)

I don't stay current on mobile arch's but yeah it seems to

You can run Prime 95 29.8 or latest AIDA 64 (fpu only) if you want to test AVX512. CPU-Z is trash tier as a stress test.

Metal_Spirit · Mar 31, 2020

[

chris1515 said:
It is a balance if OS is fully in slower RAM, you lose more bandwidh

https://www.resetera.com/threads/pl...ve-ot-secret-agent-cerny.175780/post-30333499

This is not the number of GB/s, this is how often you need to access the data. Maybe everything from OS is in slow memory but it cost more bandwidth. No solution is perfect.

Not shure I get what you guys are talking about. But it seems to me Xbox series X does have indeed a bootleneck in bandwidth if both pools of memory are accessed at the same time.

Problem lies within the memory configurations. The Xbox has 10 memory chips, 4 with 1 GB, and 6 with 2 GB. To get two pools one with 10 GB at 560 GB/s (320 bits) and one with 6 GB with 336 GB/s (192 bits), the memory disposition must be 4x1 GB modules accessed at 32 bits, and the first 1 GB of the six 2 GB modules also accessed at 32 bits.
This with 5x64 bits controllers gives you 320 bits access to all these chips, each provinding 56 GB/s, so 10 chips with 1 GB equals 10 GB at 560 GB/s.
Now for the other pool, you need to access the extra 1 GB on the 2 GB modules. Since each is connected with a 32 bits bus, and there are 6 modules, thats a 192 bits bus... and that equates to 6 GB at 336 GB/s.

Big problem is that you are quantifying the same 32 bits channel on the 2 GB modules on both pools: if that's ok to quantify maximum bandwidth on each of the pools, it doesn´t work like that in reality, since it´s the same bus on both.And if you are using the 32 bits on one pool, you cannot be using the same 32 bits channel on the other.

So to access both pools at full 32 bits the simple choice is to do it in alternate clock cycles. This is about the same as reducing the bus width to 16 bits for each pool, and acessing both at the same time.

Since the 1 GB modules are free from this, they will still provide 224 GB/s in total. But the 2 GB modules will provide half, reducing the 10 GB pool memory bandwidth to 392 GB/s, and the 6 GB one to 168 GB/s.

I really don´t know how this can be solved... Any ideas?

iroboto · Mar 31, 2020

Metal_Spirit said:
[

Not shure I get what you guys are talking about. But it seems to me Xbox series X does have indeed a bootleneck in bandwidth if both pools of memory are accessed at the same time.

Problem lies within the memory configurations. The Xbox has 10 memory chips, 4 with 1 GB, and 6 with 2 GB. To get two pools one with 10 GB at 560 GB/s (320 bits) and one with 6 GB with 336 GB/s (192 bits), the memory disposition must be 4x1 GB modules accessed at 32 bits, and the first 1 GB of the six 2 GB modules also accessed at 32 bits.
This with 5x64 bits controllers gives you 320 bits access to all these chips, each provinding 56 GB/s, so 10 chips with 1 GB equals 10 GB at 560 GB/s.
Now for the other pool, you need to access the extra 1 GB on the 2 GB modules. Since each is connected with a 32 bits bus, and there are 6 modules, thats a 192 bits bus... and that equates to 6 GB at 336 GB/s.

Big problem is that you are quantifying the same 32 bits channel on the 2 GB modules on both pools: if that's ok to quantify maximum bandwidth on each of the pools, it doesn´t work like that in reality, since it´s the same bus on both.And if you are using the 32 bits on one pool, you cannot be using the same 32 bits channel on the other.

So to access both pools at full 32 bits the simple choice is to do it in alternate clock cycles. This is about the same as reducing the bus width to 16 bits for each pool, and acessing both at the same time.

Since the 1 GB modules are free from this, they will still provide 224 GB/s in total. But the 2 GB modules will provide half, reducing the 10 GB pool memory bandwidth to 392 GB/s, and the 6 GB one to 168 GB/s.

I really don´t know how this can be solved... Any ideas?

This is the first I've ever heard of a 192 bit bus. You would draw lanes to both chips but certainly not 16 bits to each chip. Why would you do that?

Microsoft's solution for the memory sub-system saw it deliver a curious 320-bit interface, with ten 14gbps GDDR6 modules on the mainboard - six 2GB and four 1GB chips. How this all splits out for the developer is fascinating.

"Memory performance is asymmetrical - it's not something we could have done with the PC," explains Andrew Goossen "10 gigabytes of physical memory [runs at] 560GB/s. We call this GPU optimal memory. Six gigabytes [runs at] 336GB/s. We call this standard memory. GPU optimal and standard offer identical performance for CPU audio and file IO. The only hardware component that sees a difference in the GPU."

Nothing here indicates that the speeds of the memory change at all.

I thought this was straight forward:
Same pool of memory, different chip sizes. Slow and fast is really more like do you want to pull from 10 or 6.

There are 10 chips in total, each chip has 56 GB/s bandwidth on a 320 bit bus.
56 * 10 = 560 GB/s
Bandwidth is the total size of the pipe, and in this case it will be about the total amount of pull you can grab at once.
Of the 10 chips, 6 of them are 2 GB in size.
56 * 6 = 366 GB/s

If you data is on the 2nd half of the 1GB chip you will get 366 GB/s because you only have 6 chips to pull that data from. I don't care how data is stored on the 2GB chips, it will always be able to return 32 bits of data per clock cycle. Whether it's 32 bits to each GB, or 16 bits to both and the data is split. Whatever the case is, it's returning 32 bits through the main controller every single time.

But you still have 4 bus openings on the remaining 4 chips, available, just because it's accessing the back half of those 2GB chips, doesn't mean it closes off the other remaining lanes.

So you can still pull 56 *4 on the remaining 1 GB chips.
which is 224 GB/s

so adding these together, it is back to 560 GB/s.

There is no averaging of memory
There is no split pool.

You're only downside is if you put _all_ of your data on the 6x2GB chips, you're limited to a bandwidth of 336 GB/s. Because you'll grab the data on 1/2 and then if you need data on the other 1/2, you'll need to alternate. But that can be handled by priority. But that doesn't stop the developers from always fully utilizing all lanes to achieve its 560 GB/s.

Regardless if you are alternating or not, those 6 chips will be constantly giving out 336GB/s.
And regardless if you are alternating or not on the 6x2 GB chips, you still have 4x1 GB chips waiting to go. Giving you a total of 560 GB/s bandwidth whenever all 10 chips are being utilized.

This should not be treated like 2 separate ram pools like they did with 360 or XBO.

Because of the imbalance in CPU to GPU, perhaps you'll just prioritize the GPU.
While I'm not sure how priority works, edit: they'll priortize whatever is going to give the most performance. Anyway all GPU data goes into the single chips. This is an easy one. Remaining GPU data goes into the 1GB of data. The CPU data would sit on top of that, and any sort of GPGPU work that may need to be done.

TLDR; I don't really see an issue here unless you're always got contention on those 6x2 GB chips. You'd still have this probably anyway. No one would be trying this argument if it was 10x2GB chips. You'd still have the same issue if you're trying to access the memory controller to pull the data all on the same chips. It would still be 560GB/s, you just have 20 GB of memory. Would you use your current argument to say that 10x2GB chips has bottlenecked because the data is split over 2GB chips and now it needs to alternate? Or have some sort of custom controller to do this inn which it needs to trade off bandwidth?

tunafish · Mar 31, 2020

iroboto said:
Because of the imbalance in CPU to GPU, you'll just prioritize the GPU.
While I'm not sure how priority works, I suspect because of the imbalance in CPU to GPU, you'll just prioritize the GPU first.

No, in these kind of systems you always prioritize the CPU first, because it gets hit much worse by the added latency of waiting, and because the GPU can trivially use all of the bandwidth pretty much constantly, while the CPU cannot, so if you prioritize the GPU you can end up completely starving the CPU of resources.

iroboto · Mar 31, 2020

tunafish said:
No, in these kind of systems you always prioritize the CPU first, because it gets hit much worse by the added latency of waiting.

yea perhaps. You might be right.
But what if the CPU is sitting around idle most of the time anyway? Would you still prioritize it?

Well clearly the prioritization is probably a lot more complex than we are making it lol

tunafish · Mar 31, 2020

iroboto said:
yea perhaps. You might be right.
But what if the CPU is sitting around idle most of the time anyway? Would you still prioritize it?

If it's sitting idle most of the time, it's not consuming a lot of ram bandwidth and therefore having it prioritized doesn't hurt you much.

iroboto · Mar 31, 2020

tunafish said:
If it's sitting idle most of the time, it's not consuming a lot of ram bandwidth and therefore having it prioritized doesn't hurt you much.

That's also true.

3dilettante · Mar 31, 2020

dobwal said:
I have a question. With int4 and int8 support, can that mean that the XSX offers tensor core like performance with those precision modes?

I ask because the RDNA white paper makes reference to Navi CU variants.

AMD has a patent for a parallel matrix multiply pipeline using dot product units.

http://www.freepatentsonline.com/y2019/0171448.html

The Navi operations increase throughput over a conventional FMA by acting like packed math and then allowing the results to be combined into an accumulator. Looking at the patent or how the tensor operations work for Nvidia, it looks like it would be a fraction of what the matrix ops would do. The lane format without a matrix unit would allow those dot operations to generate only the results along a diagonal of that big matrix.
The AMD scheme is more consistent with Vega, as the vector unit is 16-wide, and the new hardware may align with code referencing new instructions and registers for Arcturus. One other indication this is different is that the Navi dot instruction would take up a normal vector instruction slot since it happens in the same block. Arcturus and this matrix method would allow at least some normal vector traffic in parallel.

chris1515 said:
But each time you use the slower memory you diminish the total amount of bandwidth. This an example, If you use half a second the slow memory(336 GB/s) and half a second the fast memory(560 GB/s) at the end you have 448 GB/s of bandwidth.

I understand that OS functionnality which are often accessed can be in the fast memory to have better bandwidth and functionnality rarely used in the slower memory.

EDIT: I was like you saying why use fast memory for OS this is stupid but at the end the answer is logic.

The scenario where the system is spending half a second in the slow pool requires something in the OS, an app, or a game resource put in the slow section needing 168 GB/s of bandwidth.
There is some impact because of the imbalance, but it scales by what percentage of the memory access mixture goes to the slow portion. If a game did that, it would likely be considered a mis-allocation. A background app would likely be inactive or prevented from having anything like that access rate, and the OS gets by with a minority share of the tens of GB/s in normal PCs without issue.
I can see the OS sporadically interacting with shared buffers for signalling purposes or copying data from secured memory to a place where the game can use it, but that's on the the order of things like networking or the less than 10 GB/s disk IO.

Metal_Spirit said:
Not shure I get what you guys are talking about. But it seems to me Xbox series X does have indeed a bootleneck in bandwidth if both pools of memory are accessed at the same time.

If the GDDR6 chips were all the same capacity, there would still be a "pool" for the OS and apps, since accesses for them wouldn't be going to the game. The individual controllers would see some percentage of accesses going to them that the game wouldn't be able to use. Let's say 1% goes the OS, or 5.6GB/s. The game experiences a bandwidth bottleneck if it needs something like 555 GB/s in that given second. If there's a set of code, sound data, or rarely accessed textures that don't get used in the current game scene unless the user hits a specific button or action, finally hitting that action while the game is going on blocks the other functions' accesses for some number of cycles.
With the non-symmetric layout, the OS or slow pool pushes some of that percentage onto the channels associated with the 2GB chips.
Going by the 1% scenario, the six controllers would need to find room for the 40% of the OS traffic that cannot be striped across the smaller chips, or 40% of 5.6 GB/s. The 336 GB/s pool would be burdened with an extra 2.24GB/s.

Unless something in the slow pool is demanding a significant fraction of the bandwidth, and I don't know what functionality other than the game's renderer that needs bandwidth on that order of magnitude, I can see why Microsoft saw it as a worthwhile trade-off.
If a game put one of its most heavily used buffers in the slow pool, I think the assumption is that the developers would find a way to move it out of there or scale it back so it'd fit elsewhere.

mrcorbo · Mar 31, 2020

Metal_Spirit said:
So to access both pools at full 32 bits the simple choice is to do it in alternate clock cycles.

There's no reason to think they are doing this. When you access an address in slow RAM (or it's the CPU/IO), you do it at 192-bits. When you access an address in fast RAM (and it's the GPU), it's done at 320-bit. There's no reason why you'd have to alternate cycles. Which pool you were accessing would be totally dictated by client (GPU,GPU or I/O) demand.

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

jlippo

chris1515

Silent_Buddha

chris1515

Lurkmass

Proelite

chris1515

BRiT

(>• •)>⌐■-■ (⌐■-■)

iroboto

Daft Funk

Deleted member 13524

Guest

iroboto

Daft Funk

RobertR1

Pro

Metal_Spirit

iroboto

Daft Funk

tunafish

iroboto

Daft Funk

tunafish

iroboto

Daft Funk

3dilettante

mrcorbo

Foo Fighter

Similar threads