Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

iroboto · Mar 31, 2020

Metal_Spirit said:
No... but apparently i'm having trouble passing my message.
So, i´ll leave it be... Since I'm not managing to explain myself, I dont´want to pass the ideia that i'm trying to find problems on any system.

I'll just remind you guys of this case:

https://www.anandtech.com/show/6159/the-geforce-gtx-660-ti-review/2

hmm. I would wait for a deep dive before you jump on this line of thought. You don't know how their memory controllers work, how many they have, and how their interweaving works out.
You are looking at an absolute worst case and making this comparison assuming the memory controller won't fill the remaining 192 GB/s with other data.

But lets do this anyway for the sake of clarity. I will work out your scenario as to what should be happening at a simplistic, amateur but granular level.

Anyone feel free to correct me here; lots of senior members around lately.
Lets assume a best case scenario for PS5 that data needed is 512 bits.

XSX
1st clock cycle: 320 bits off the first 10 chips
2nd clock cycle: 192 bits off the remaining 6 GB chips and the other lanes are wasted
Total = 512 bits pulled in 2 clock cycles.

PS5:
2 clock cycles for PS5:
It will grab 256
Then another 256 bits
Total = 512 Bits

This is of course, if we assume the memory controller setup such that it will _not_ fill the extra lanes and in a situation where you have _2_ devices contending for memory. Then both are exactly equal in the worst case scenario you speak of.

But lets look at another case then:
If data was sized and spread in such a way that it was exactly 40 bytes or exactly 320 bits, or the best case scenario for XSX.

XSX will grab all this data in 1 clock cycle
PS5 will need 2 cycles to do this and on the second cycle it wastes the remaining chip lanes for the request.

Lets look at a real example then.
4KB or 40,960 bits. This is a standard hard drive block.
This divides perfectly into the 10x32 bit bus and it will access the memory all 10 chips every time in full 32 bit blocks. This is the case for anything in multiple of 4KB.

40960 bits is respectively:
XSX: 128 clock cycles
PS5: 160 clock cycles

They are the same memory speeds. So there are no additional differences here. So now XSX can start processing another request 32 clock cycles before PS5 completes.

What about 1024 KB then?
XSX: 32768 clock cycles
PS5: 40960 clock cycles

Alright so from this we see that if the memory is not full, XSX can go much faster than PS5.
So lets look at the worst case scenario, all 16 GB is full. Lets have them race to offload all 16 GB.
Well earlier I showed you that it would take 2 clock cycles for XSX to take 320 bits off the first 10 GB chips followed by a second clock cycle to take off 192 bits off the remaining 6 GB chips.

In 2 clock cycles this equated PS5.

So that means in a race, for the first 6 GB of 16GB, they will be equal in the number of clock cycles to clear out 6 GB.
The remaining 10 GB, XSX will blast through it before PS5.

And there you have your scenario played out. And this is why we don't average speeds.

McHuj · Mar 31, 2020

iroboto said:
This is of course, if we assume the memory controller setup such that it will _not_ fill the extra lanes and in a situation where you have _2_ devices contending for memory. Then both are exactly equal in the worst case scenario you speak of.

I would be really curious to learn how the XSX memory controller works and manages memory transactions on this 320-bit bus. Every thing I've ever seen (at least the hardware I worked with), bus width were always powers of 2 and cache line fetches broke down nicely into them to avoid partial transactions.

Are GPU transactions just big? Does a GPU fetch multiple lines in one transaction? Or can the controller coalesce transactions in some efficient manner to minimize overhead and dead cycles.

Gubbi · Mar 31, 2020

It's not one big 320 bit bus, it's either 5 independent 64 bit busses or 10 32 bit ones.

Cheers

Metal_Spirit · Mar 31, 2020

iroboto said:
hmm. I would wait for a deep dive before you jump on this line of thought. You don't know how their memory controllers work, how many they have, and how their interweaving works out.
You are looking at an absolute worst case and making this comparison assuming the memory controller won't fill the remaining 192 GB/s with other data.

But lets do this anyway for the sake of clarity. I will work out your scenario as to what should be happening at a granular level.

Anyone feel free to correct me here; lots of senior members around lately.
Lets assume a best case scenario for PS5 that data needed is 512 bits.

XSX
1st clock cycle: 320 bits off the first 10 chips
2nd clock cycle: 192 bits off the remaining 6 GB chips and the other lanes are wasted
Total = 512 bits pulled in 2 clock cycles.

PS5:
2 clock cycles for PS5:
It will grab 256
Then another 256 bits
Total = 512 Bits

This is of course, if we assume the memory controller setup such that it will _not_ fill the extra lanes and in a situation where you have _2_ devices contending for memory. Then both are exactly equal in the worst case scenario you speak of.

But lets look at another case then:
If data was sized and spread in such a way that it was exactly 40 bytes or exactly 320 bits, or the best case scenario for XSX.

XSX will grab all this data in 1 clock cycle
PS5 will need 2 cycles to do this and on the second cycle it wastes the remaining chip lanes for the request.

Lets look at a real example then.
4KB or 40,960 bits. This is a standard hard drive block.
This divides perfectly into the 320 bit bus and it will access the memory all 10 chips every time in full 32 bit blocks. This is the case for anything in multiple of 4KB.

40960 bits is respectively:
XSX: 128 clock cycles
PS5: 160 clock cycles

They are the same memory speeds. So there are no additional differences here. So now XSX can start processing another request 32 clock cycles before PS5 completes.

What about 1024 KB then?
XSX: 32768 clock cycles
PS5: 40960 clock cycles

Let me review your cases, because it seems to me you are useing best case scenarios on Xbox on all examples.

"Lets assume a best case scenario for PS5 that data needed is 512 bits.

XSX
1st clock cycle: 320 bits off the first 10 chips
2nd clock cycle: 192 bits off the remaining 6 GB chips and the other lanes are wasted
Total = 512 bits pulled in 2 clock cycles.

PS5:
2 clock cycles for PS5:
It will grab 256
Then another 256 bits
Total = 512 Bits"

Aren´t you assuming a perfect data disposition on the chips?
What if the data needed for the second cycle was on the 10 MB?
Xbox would be wasting remaining lanes.

Now, let's assume all data is on the 6GB slow memory. It would require 3 cycles and some waste also!

"If data was sized and spread in such a way that it was exactly 40 bytes or exactly 320 bits, or the best case scenario for XSX.

XSX will grab all this data in 1 clock cycle
PS5 will need 2 cycles to do this and on the second cycle it wastes the remaining chip lanes for the request."

What if that data is fragmented on both memories? You would require 2 cycles
Or if it is all on the slower memory? 2 cycles plus waste.

"Lets look at a real example then.
4KB or 40,960 bits. This is a standard hard drive block.
This divides perfectly into the 320 bit bus and it will access the memory all 10 chips every time in full 32 bit blocks. This is the case for anything in multiple of 4KB.

40960 bits is respectively:
XSX: 128 clock cycles
PS5: 160 clock cycles"

Another best case scenario for Xbox... Is it not?
You are assuming you can read all from the 10 GB But we are talking 4K. What if those 4K are in slower memory? 214 cycles for Xbox!

I think you got my point!

Regardless I never introduced PS5 to the equation. Never compared performance of both! I was just talking that those 560 GB/s on Xbox X. like Tflops, are not telling everything. And that penalties can occur creating penalties to performance. If I ever mentioned PS5 was due to the fact Sony's console does not present these problems. Nothing else.

But yes... I'm comparing this with the Geforce case... It can happen!

McHuj · Mar 31, 2020

Gubbi said:
It's not one big 320 bit bus, it's either 5 independent 64 bit busses or 10 32 bit ones.

Cheers

Interesting. I didn't think of it that way. I always assumed its one big bus vs a collection of smaller buses.

iroboto · Mar 31, 2020

Gubbi said:
It's not one big 320 bit bus, it's either 5 independent 64 bit busses or 10 32 bit ones.

Cheers

Its 10x32 bit.
Wait, nvm, it may not be.

3dilettante · Mar 31, 2020

mrcorbo said:
So, if there were requests from the CPU to data contained in the 6 chips that make up the slow pool and requests from the GPU for data contained in the 4 chips not included in the slow pool, could these theoretically be fulfilled concurrently?

Usually, these accesses are in the form of a cache line or two, depending on the specifics of the line length. GCN had a mixture of 32B and 64B lines, x86 is usually 64B, and RDNA is a mix of 64B and 128B. DRAM pages are on the order of several KB, with GDDR6 being 2KB from what I've searched up.
The chips have 2 independent channels each, and the optimal case is for a transaction to be satisfied by one channel/controller.

While there are instances of IO or some CPU work that might not align well, the common case is that DRAM can provide 32B per burst, and likes it if you can access the next KB or so before moving along. The caches expect some small multiple of 32B or 64B for their transactions, and things like the GPU's overall rasterization pipeline exist to match up well with DRAM and cache alignment. Hence why techniques that are less coherent tend to do poorly, and why it's not that easy to replace the hardware that matches the memory so well.

iroboto said:
XSX
1st clock cycle: 320 bits off the first 10 chips
2nd clock cycle: 192 bits off the remaining 6 GB chips and the other lanes are wasted
Total = 512 bits pulled in 2 clock cycles.

This generally isn't going to happen.
A GDDR6 DRAM data payload comes over a channel over a burst of 16 clocks. That is usually half of a cache line (meaning next DRAM transaction is very predictable), and almost all accesses are going to be going through the cached memory pipeline.
An attempt to pull just the first 320 bits over the whole system leaves 15 subsequent bus cycles where a transaction is either reading the data or wasting 15/16 of the bandwidth.

Each GDDR6 chip has two independent 16-bit channels (subject to undisclosed details of AMD's hardware), and each channel gets half the chip's physical capacity. The DRAM arrays encourage linear accesses and hitting open pages as much as possible, because there are very large overheads related to changing banks or changing the activity over the bus.

zupallinere · Mar 31, 2020

MrFox said:
Wear is proportional to capacity. Twice the flash size can do twice the write volume with wear levelling. So it's would be no different than having 4 or 6 additional chips for the OS.

This would be a small win maybe for the PS5 since you could get a larger SSD to even things out. I doubt it is worth the cost early on but if you wanted to be on the safe side.

MrFox · Mar 31, 2020

zupallinere said:
So no major thermal win with a more efficient power supply then. Dunno how effecient the average console PSU would be but 8 to 10% more might not get you much.
Yeah material and design/manufacturing costs and it has to be pretty quiet. I was going to say maybe the cooling doesn't have to be expensive as long as you move the air quickly enough but then that is noise. Maybe the did something there ?
The PS5 might be an intriguingly baffling system.

Higher efficiency and extremely compact PSUs go up in price very quickly because they use some really expensive parts to get low switching losses and very high switching frequencies. So it looks like it's never been worth any effort to improve this on stationary devices, it just costs more for not much gain on a home console.

But I love the new stuff coming into market which could help make so crazy small PSUs. The second generation of GaN FETs could make a 500W PSU the size of a wallwart.

iroboto · Mar 31, 2020

3dilettante said:
This generally isn't going to happen.
A GDDR6 DRAM data payload comes over a channel over a burst of 16 clocks. That is usually half of a cache line (meaning next DRAM transaction is very predictable), and almost all accesses are going to be going through the cached memory pipeline.
An attempt to pull just the first 320 bits over the whole system leaves 15 subsequent bus cycles where a transaction is either reading the data or wasting 15/16 of the bandwidth.

Each GDDR6 chip has two independent 16-bit channels (subject to undisclosed details of AMD's hardware), and each channel gets half the chip's physical capacity. The DRAM arrays encourage linear accesses and hitting open pages as much as possible, because there are very large overheads related to changing banks or changing the activity over the bus.

Thanks,

What would be a more basic measure, a 1/2 cache line?
How does 4KB, 1024KB, and 1024MB get divided among the memory banks (typical scenario)? Will it divide it over all the chips, or toss it into a single chip?

zupallinere · Mar 31, 2020

MrFox said:
But I love the new stuff coming into market which could help make so crazy small PSUs. The second generation of GaN FETs could make a 500W PSU the size of a wallwart.

Yeah Anker went big on that early on and it looks nice.

Scott_Arm · Mar 31, 2020

@Metal_Spirit This is a console with a low level API. They'll be able to design their memory layout to mitigate any issues vs an older PC GPU that was using a high-level API that did not necessarily have an explicit allocation of VRAM.

Scott_Arm · Mar 31, 2020

iroboto said:
Thanks,

What would be a more basic measure, a 1/2 cache line?
How does 4KB, 1024KB, and 1024MB get divided among the memory banks (typical scenario)? Will it divide it over all the chips, or toss it into a single chip?

If your read is small than 1 cache line, you basically waste bandwidth. If a cache line is 64B and you read 32B, you waste 32B and get half the effective bandwidth. That's my understanding of it.

Edit: Well, I guess it's more complicated. If the next read you want is the next 32B, then you've already cached it and it's not wasted. But if the next 32B is irrelevant data, then you wasted half the cache line and you halved your bandwidth. CPU may be different than GPU.

iroboto · Mar 31, 2020

Scott_Arm said:
If your read is small than 1 cache line, you basically waste bandwidth. If a cache line is 64B and you read 32B, you waste 32B and get half the effective bandwidth. That's my understanding of it.

That makes sense. Yea it's just too small of an amount.
You'd have to read something larger.

3dilettante · Mar 31, 2020

iroboto said:
Thanks,

What would be a more basic measure, a 1/2 cache line?
How does 4KB, 1024KB, and 1024MB get divided among the memory banks (typical scenario)? Will it divide it over all the chips, or toss it into a single chip?

Cache transactions are the size of their lines, so that's a good base unit.
How pages are striped over channels varies. GPUs can stripe data so that each channel gets 128 bytes, or that was the case for some older APUs. The granularity for the ROPs, texture units, pixel quads, and rasterizer at the size of popular formats likely encouraged this. If the desire is to have as much parallel access to memory as possible, a pixel export that spits out hundreds of bytes isn't served by sending all that traffic into a single destination.
CPU preferences go the other way, where there's not as much bandwidth, but the CPU wants to avoid costly latency penalties if it needs to change DRAM pages. At the same time, if there are multiple NUMA nodes, striping across nodes can either balance utilization or choke a high-bandwidth application.

There's no single right answer, so it comes down to what the system wants to optimize for.

iroboto · Mar 31, 2020

3dilettante said:
There's no single right answer, so it comes down to what the system wants to optimize for.

You're a beast.
You are:

Okay, so at this point in time; there's no way to tell what MS did with their memory controller setup is what you're saying. So without bench-marking, we really don't know how it's going to perform or even behave.

iroboto · Mar 31, 2020

Metal_Spirit said:
Regardless I never introduced PS5 to the equation. Never compared performance of both!

There are multiple forum posts and videos going around the interwebs about this with your exact axioms. I figured I would address the claim that PS5 had more bandwidth than XSX through averaging. The idea of a slow and fast pool of memory is equally aggravating. Its the same damn clockrate and bus width. 6 chips take longer to fill up than the other 4. There is no fast and slow though.

Sorry though, didn't mean to imply.

RobertR1 · Mar 31, 2020

We’re gonna need to be able to run pc synthetics on these consoles to get the shitpost value to sky rocket.

3dilettante · Mar 31, 2020

iroboto said:
You're a beast.

Okay, so at this point in time; there's no way to tell what MS did with their memory controller setup is what you're saying. So without bench-marking, we really don't know how it's going to perform or even behave.

I think it's possible that the GPU-optimized portion could stripe data differently than the standard portion of memory, since the GPU-optimized portion is supposed to give the GPU as much bandwidth as possible, and striping can give the GPU more opportunities to generate parallel traffic.
NUMA considerations aren't really a concern in a single-chip system, so I think that won't be something they'll optimize towards.

I think it'd be fine to expect the system to do well in utilizing bandwidth in the GPU-optimized memory, since the GPU is so important to the console's purpose. AMD has indicated it's improved how well its memory controllers balance CPU and GPU traffic, or at least I hope they have since 2013.

Rockster · Mar 31, 2020

I'm honestly confused by the notion that there is a different performance expectation due to asymmetrical chip densities. If it were to use all 2gb chips, the arbitration of CPU and GPU access by the memory controller would be no different other than managing it over the entire address space as opposed to a portion of it.

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

iroboto

Daft Funk

McHuj

Gubbi

Metal_Spirit

McHuj

iroboto

Daft Funk

3dilettante

zupallinere

MrFox

Deludedly Fantastic

iroboto

Daft Funk

zupallinere

Scott_Arm

Scott_Arm

iroboto

Daft Funk

3dilettante

iroboto

Daft Funk

iroboto

Daft Funk

RobertR1

Pro

3dilettante

Rockster

Similar threads