Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Dictator · Apr 1, 2020

jlippo said:
For such cases the solution is most likely a different workflow instead of just using huge texture per object.
More like material/detail/decal textures/shaders combined to create result.

I'm thinking the way they did clothes in Uncharted 4.

100% agree with this - I think decals, layers, and tiling detail textures plus shader diversity work (cloth shader next to an illum shader next to a skin shader next to a hair shader next to a fibre glass shader) is going to give that minute detail and differentiation - and not huge per object textures!

DavidGraham · Apr 1, 2020

iroboto said:
The only difference here is that with PS5 you run into full memory contention, and we have a graph of how bandwidth drops significantly as the CPU uses more bandwidth on PS4. Which makes sense, while CPU is getting the data it needs, it's not going somehow mix the request and fill the gaps with GPU data. It's going to get it's pull, then the GPU gets what it needs, and vice versa.

I think by now we know this kind of memory arrangement has it's major drawbacks, that simply outweigh the drawbacks of a split memory pool. You pay for the flexibility in the shared memory pool by significantly increasing the bandwidth, otherwise, CPU/GPU contention will wipe clean any advantages you have in the shared pool arrangement.

By the same logic, the PS5 also introduced CPU/GPU clocks contention, just looking at this superficially alone, you know this kind of system will never be optimal.

Betanumerical · Apr 1, 2020

iroboto said:
You're going to have inefficiency somewhere.

On PS5 when the CPU needs its work, their GPU will be locked out until the CPU gets what it needs except for the chips it can pull from.

ON XSX when the CPU needs its work, their GPU will be locked out until the CPU gets what is needs, except for the chips that are still available to pull from. In this case, there will always be 4 chips dedicated for the GPU that the CPU can't touch.

Do we know if this is how it is actually laid out, or is it just setup as a slow and fast memory space with striping across different numbers of memory modules?. I don't think I saw anything that indicated that there was bandwidth or channels dedicated to the GPU and CPU in the XSX.

Additionally wouldn't any memory thats accessed during this 'contented' period with only 4 chips only be accessible at 224GB/s?.

Im not entirely sure what is being suggested here is how things work in the real world. To me it seems like the simpler solution of just giving exclusive access of the bus to a single 'unit' on the APU for the period of the memory transaction would be how it would work in practise.

Shifty Geezer · Apr 1, 2020

DavidGraham said:
I think by now we know this kind of memory arrangement has it's major drawbacks, that simply outweigh the drawbacks of a split memory pool. You pay for the flexibility in the shared memory pool by significantly increasing the bandwidth, otherwise, CPU/GPU contention will wipe clean any advantages you have in the shared pool arrangement.

It's not about flexibility, but capacity vs BW. If you have 8 GBs VRAM and 16 GBs RAM, your drawing either limits itself to 8 GBs VRAM to use its full BW with only 8GBs assets (less framebuffers), or stores assets in RAM and can use more than 8 GBs assets but has to copy them across the slow bus into VRAM to draw. If you have 16 GBs unified RAM, you have all 16 GBs available for assets but impact maximum BW.

More assets, or faster reads and writes? Pick your poison - you can't have both. The PC's choice is nothing to do with it being the better compromise but the necessary design to support the open ended architecture, and it makes up for the crap RAM bandwidth by using great gobs of expensive VRAM as a massive cache to that redundantly duplicated data sitting in both pools.

Shifty Geezer · Apr 1, 2020

iroboto said:
The only difference here is that with PS5 you run into full memory contention, and we have a graph of how bandwidth drops significantly as the CPU uses more bandwidth on PS4. Which makes sense, while CPU is getting the data it needs, it's not going somehow mix the request and fill the gaps with GPU data. It's going to get it's pull, then the GPU gets what it needs, and vice versa.

But that doesn't explain (to me anyway!) why the BW drop was far higher than the CPU was using, and why that can't be fixed with a better memory controller. I would have expected (as did everyone else, because the BW drop came as a surprise) that while the CPU was accessing the RAM, the GPU had to wait, but it'd be 1:1 CPU usage to BW impact. What we saw on Liverpool was the RAM losing efficiency somehow, as if there was a switching penalty. I would hope AMD can fix that issue and have a near 1:1 impact on their console UMAs, so 1 ms of full RAM access for the CPU means only 1 ms less available for the GPU and the remaining frame time accessible at full rate.

iroboto · Apr 1, 2020

Shifty Geezer said:
But that doesn't explain (to me anyway!) why the BW drop was far higher than the CPU was using, and why that can't be fixed with a better memory controller. I would have expected (as did everyone else, because the BW drop came as a surprise) that while the CPU was accessing the RAM, the GPU had to wait, but it'd be 1:1 CPU usage to BW impact. What we saw on Liverpool was the RAM losing efficiency somehow, as if there was a switching penalty. I would hope AMD can fix that issue and have a near 1:1 impact on their console UMAs, so 1 ms of full RAM access for the CPU means only 1 ms less available for the GPU and the remaining frame time accessible at full rate.

Yea some parts weren’t clear. Well, @3dilettante explanation actually helps provide a lot of context of what might be happening with respect to optimizing memory layouts for either GPU for maximum bandwidth vs CPU for less paging. The more the CPU has to jump around the longer it takes; perhaps eating into bandwidth further.

it may be different this time around 7 years later.

psorcerer · Apr 1, 2020

Silent_Buddha said:
So you are saying that...

Firstly, Sony were having trouble keeping both the GPU and CPU at 2 GHz and 3 GHz respectively.

Now, Sony have no trouble keeping the GPU at 2.2+ GHz and CPU at 3.5 GHz...at the same time.

???

I'm not sure we need to explain it in a 100th time. It was perfectly fine the first 10 times, but now it's getting ridiculous.
The had problems maintaining 2+3 for a maximum possible load. Because it's a something that nobody does in hardware world.
XBSX claims that they do it, but that claim should be scrutinized, because that claim is unrealistic, not Sony's one.
Sony's claim is perfectly fine: when power hungry operations are used too much the GPU or CPU underclocks. It was a case for every GPU and CPU till now.
The novel thing is that they measure the load/power by profiling the instructions in real-time.
And yes any CPU vendor that states any frequency is doing it on an "average load" basis, and not on the "100% max load ever possible", which is usually a much much lower freq (like the 1.6GHz example for Intel on a 3.5GHz CPU).

jlippo said:
For such cases the solution is most likely a different workflow instead of just using huge texture per object.
More like material/detail/decal textures/shaders combined to create result.

That's online lighting calcualtions. Using textures is a way of trading memory<->ALU. In the end if your pipeline is some sort of the "full" lighting solution RT/photon mapping/radiosity/etc. you don't need textures at all.
You have your materials and all their properties are calculated in realtime.
Unfortunately current state of ALU power prohibits such things from running real-time (even Minecraft-like graphics uses textures with RT).
Therefore pre-baking things is still the most viable solution. Essentially you use ALU for the most dynamic stuff in your frame or the stuff that's most visible as "dynamic" in frame. And statically bake all other things.
On the other hand there are a lot of other dynamic things that can be prebaked for a great effect.
For example the weather. It changes pretty slowly (like over ~1000 frames), so you can have essentially a full blown weather transition with SSD, where each weather change brings a whole new set of an environment animations/details/decals/textures for example.
You can have a permanent growth/damage/wear in a lot of places, which was a trouble to load in time with HDD.
Etc.

Shifty Geezer said:
has to copy them across the slow bus into VRAM to draw

Not to mention that these copies are heavily CPU involved, because device buffer format is not the same as the host buffer format.
Which is another thing that is a consequence of an open-ended architecture.

iroboto · Apr 1, 2020

Betanumerical said:
Do we know if this is how it is actually laid out, or is it just setup as a slow and fast memory space with striping across different numbers of memory modules?. I don't think I saw anything that indicated that there was bandwidth or channels dedicated to the GPU and CPU in the XSX.

Additionally wouldn't any memory thats accessed during this 'contented' period with only 4 chips only be accessible at 224GB/s?.

Im not entirely sure what is being suggested here is how things work in the real world. To me it seems like the simpler solution of just giving exclusive access of the bus to a single 'unit' on the APU for the period of the memory transaction would be how it would work in practise.

I no longer think it’s obvious how MS will address this after getting better insight. Yes it would be ideal to pick up the remaining amount instead of casting it aside.

Betanumerical · Apr 1, 2020

iroboto said:
I no longer think it’s obvious how MS will address this after getting better insight. Yes it would be ideal to pick up the remaining amount instead of casting it aside.

I agree it would be ideal to pick up the remaining. But if that was the case wouldn't it infer the existence of three different memory spaces and not two?. Those three being.

1 - Fast space (10/10 memory modules)
2 - Slower space (6/10 memory modules)
3 - Parallel access space (4/10 memory modules)

You'd have to specifically allocate for all three spaces.

iroboto · Apr 1, 2020

DavidGraham said:
I think by now we know this kind of memory arrangement has it's major drawbacks, that simply outweigh the drawbacks of a split memory pool. You pay for the flexibility in the shared memory pool by significantly increasing the bandwidth, otherwise, CPU/GPU contention will wipe clean any advantages you have in the shared pool arrangement.

By the same logic, the PS5 also introduced CPU/GPU clocks contention, just looking at this superficially alone, you know this kind of system will never be optimal.

i still think it makes sense on console to do it this way. It just saves costs and you are right there is bandwidth loss during contention but when there isn’t the gains are there. And when it is present you lose bandwidth but not so much that it’s choking the system.

seems a fair trade off. It’s lower but not to low, and it can reach some good highs.

iroboto · Apr 1, 2020

Betanumerical said:
I agree it would be ideal to pick up the remaining. But if that was the case wouldn't it infer the existence of three different memory spaces and not two?. Those three being.

1 - Fast space (10/10 memory modules)
2 - Slower space (6/10 memory modules)
3 - Parallel access space (4/10 memory modules)

You'd have to specifically allocate for all three spaces.

I don't know how many controllers there are, so I guess I'm not sure what is possible. I have pretty good confidence that MS knows the exact performance of this machine before they burnt it using live code. They should know as they did with X1X. If this is really causing an issue, we would know by now. But more importantly if it was really causing an issue and developers didn't like it, the option to convert the 4 single 1GB chips to 2GB is still an option. But I don't see this happening.

MS has been rolling forward with their plans and hitting a cadence here. Observing their entire strategy shows me that everything is going with respect to their expectations and plans. A concession here would be something they missed and it would be startling that after years of 'customization' that they suddenly discovered the challenges of asymmetrical memory capacity right before launch. It's almost like MS spent 3 years doing nothing since they built most of their features and resolved most of their issues during the X1X generation.

Kaotik · Apr 1, 2020

iroboto said:
I don't know how many controllers there are, so I guess I'm not sure what is possible. I have pretty good confidence that MS knows the exact performance of this machine before they burnt it using live code. They should know as they did with X1X. If this is really causing an issue, we would know by now. But more importantly if it was really causing an issue and developers didn't like it, the option to convert the 4 single 1GB chips to 2GB is still an option. But I don't see this happening.

There's either 5 controllers with 4x16bit channels each, 10 32-bit controllers with 2x16-bit channels each or 20 16-bit controllers. Each GDDR6-chip has two independent 16-bit channels, which the memory controllers need to adhere to.

pjbliverpool · Apr 1, 2020

Silent_Buddha said:
We don't really have a precedent like this generation for consoles where both the CPU and GPU will basically be equivalent to modern day PC CPUs and GPUs.

The original xbox was pretty similar.

PS3 had a very powerful CPU when it came to performing multiple tasks (no so much with general CPU code), however that had to make up for a pretty anemic GPU.

RSX was quite comparable to modern high end PC GPUs of the time being largely equivalent to a Geforce 7800 GTX 256 - a more or less top end GPU of it's time. I'd say it was as at least as close to top end PC GPUs than what the PS4 or PS5 are launching with. It only seemed anemic at the time because it was preceded by Xenos which was well ahead of it's time, and came just before the much more powerful 8800GTX arrived in the PC market.

Kaotik · Apr 1, 2020

pjbliverpool said:
The original xbox was pretty similar.

RSX was quite comparable to modern high end PC GPUs of the time being largely equivalent to a Geforce 7800 GTX 256 - a more or less top end GPU of it's time. I'd say it was as at least as close to top end PC GPUs than what the PS4 or PS5 are launching with. It only seemed anemic at the time because it was preceded by Xenos which was well ahead of it's time, and came just before the much more powerful 8800GTX arrived in the PC market.

While it was equivalent to 7800 GTX in many ways, it had only half the ROPs and half the memory bandwidth (not counting XDR here since RSX had to access it via CPU)

chris1515 · Apr 1, 2020

Shifty Geezer said:
But that doesn't explain (to me anyway!) why the BW drop was far higher than the CPU was using, and why that can't be fixed with a better memory controller. I would have expected (as did everyone else, because the BW drop came as a surprise) that while the CPU was accessing the RAM, the GPU had to wait, but it'd be 1:1 CPU usage to BW impact. What we saw on Liverpool was the RAM losing efficiency somehow, as if there was a switching penalty. I would hope AMD can fix that issue and have a near 1:1 impact on their console UMAs, so 1 ms of full RAM access for the CPU means only 1 ms less available for the GPU and the remaining frame time accessible at full rate.

AMD has a patent for improve and mitigated this problem. They simply prioritize CPU memory call because it is more sensible to latency than GPU. I don't remember where to find the patent but someone find it on era @anexanhume maybe?

anexanhume · Apr 1, 2020

chris1515 said:
AMD has a patent for improve and mitigated this problem. They simply prioritize CPU memory call because it is more sensible to latency than GPU. I don't remember where to find the patent but someone find it on era @anexanhume maybe?

Yes, there was a patent. It’s going to take a while for me to find it.

edit, there was this, but I think there’s another that more directly addresses memory use in a HSA system.
http://www.freepatentsonline.com/y2019/0122417.html

function · Apr 1, 2020

Shifty Geezer said:
But that doesn't explain (to me anyway!) why the BW drop was far higher than the CPU was using, and why that can't be fixed with a better memory controller. I would have expected (as did everyone else, because the BW drop came as a surprise) that while the CPU was accessing the RAM, the GPU had to wait, but it'd be 1:1 CPU usage to BW impact. What we saw on Liverpool was the RAM losing efficiency somehow, as if there was a switching penalty. I would hope AMD can fix that issue and have a near 1:1 impact on their console UMAs, so 1 ms of full RAM access for the CPU means only 1 ms less available for the GPU and the remaining frame time accessible at full rate.

Maybe ... if you're performing (for example) a single GPU access with data spread across two channels (say a 128-bit vector), and the CPU jumps to the front of the queue with a pesky 32-bit read, and the memory controller can't easily schedule something to do on the other channel with no notice ... the total lost bandwidth to the GPU would be greater than the actual data transferred to the CPU.

If the GPU likes wide access and the CPU is often doing narrow, CPU access being prioritised could really multiply the losses to the GPU.

Access patterns and priorities and all that, innit.

pjbliverpool · Apr 1, 2020

Kaotik said:
While it was equivalent to 7800 GTX in many ways, it had only half the ROPs and half the memory bandwidth (not counting XDR here since RSX had to access it via CPU)

Very true. Although it still likely fared at least as favorably towards the 7800 GTX 256 as the PS5 GPU will against the high end PC GPU's at the time of it's launch (i.e. big Navi and GA102).

Rockster · Apr 1, 2020

Betanumerical said:
I agree it would be ideal to pick up the remaining. But if that was the case wouldn't it infer the existence of three different memory spaces and not two?. Those three being.

1 - Fast space (10/10 memory modules)
2 - Slower space (6/10 memory modules)
3 - Parallel access space (4/10 memory modules)

You'd have to specifically allocate for all three spaces.

Huh? There is just one memory space with 320-bit interface to the memory controller. The only thing that varies is chip densities, so some modules have a greater capacity or addressable range than others. Just because you might request data from a range within one module, that extends beyond the available range of another, does nothing to prevent all modules from being accessed every cycle. As far as how coherent CPU requests impact the GPU, that remains the same regardless.

PSman1700 · Apr 1, 2020

pjbliverpool said:
Very true. Although it still likely fared at least as favorably towards the 7800 GTX 256 as the PS5 GPU will against the high end PC GPU's at the time of it's launch (i.e. big Navi and GA102).

RSX was basically a generation behind, 8800 series launched autumn 2006 (along with the intel quads). It didnt help it that the 8800 was one of the biggest jumps in history.
One would hope ps5 gpu fares better then rsx did. But since were at about 14TF now for 2018 released gpus (not counting titan), highest end navi/ampere could be close to 20TF, perhaps with hbm on amds side. Could be close to double, probably without downclocks.

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Dictator

DavidGraham

Betanumerical

Shifty Geezer

uber-Troll!

Shifty Geezer

uber-Troll!

iroboto

Daft Funk

psorcerer

iroboto

Daft Funk

Betanumerical

iroboto

Daft Funk

iroboto

Daft Funk

Kaotik

Drunk Member

pjbliverpool

B3D Scallywag

Kaotik

Drunk Member

chris1515

anexanhume

function

None functional

pjbliverpool

B3D Scallywag

Rockster

PSman1700

Similar threads