Technical Comparison Sony PS4 and Microsoft Xbox

blakjedi · May 27, 2013

Betanumerical said:
Yes the GPU can read from virtually any cache (seems to be a big thing in both systems), but it doesnt mean its quick. The coherent bus is only 30GB/s. You'd be better off doing what Gubbi said and writing to it from the GPU's output. Reading from the CPU caches you get a max of 20GB/s.

Its only 32mb. Lol

Betanumerical · May 27, 2013

blakjedi said:
Its only 32mb. Lol

memory speeds still matter as it takes time for the data to be transfered, and using the north bridge you have a access to a tiny susbet of memory (the caches) at about 5MB so i guess atleast you dont need to care about the whole 32MB being filled.

blakjedi · May 27, 2013

How many threads per clock per shader core does ps4 execute? Vgleaks says 32 but everyone else seems to calculate using 64.

Any clarification there?

Betanumerical · May 27, 2013

blakjedi said:
How many threads per clock per shader core does ps4 execute? Vgleaks says 32 but everyone else seems to calculate using 64.

Any clarification there?

PS4/XBONE have the same base architecture so they both execute the same amount of threads per clock cycle / CU.

warb · May 27, 2013

I'm still not sure how the GPU reading from both DDR3 and eSRAM, and writing to either would cost significanty more bandwidth than doing the same in GDDR5.

Like how is the GPU reading from DDR3 and writing to eSRAM (or vice versa) more costly than read/write to GDDR5? Assuming there aren't 32mb blocks of data being moved around willy-nilly. Does it make a difference if the framebuffer lives in DDR3.

french toast · May 27, 2013

Betanumerical said:
PS4/XBONE have the same base architecture so they both execute the same amount of threads per clock cycle / CU.

So What effect do the 8 ACEs have on ps4 gpu?

blakjedi · May 27, 2013

Betanumerical said:
PS4/XBONE have the same base architecture so they both execute the same amount of threads per clock cycle / CU.

Do you have a source?

Scott_Arm · May 27, 2013

blakjedi said:
Do you have a source?

AMD GCN whitepaper?

Betanumerical · May 27, 2013

french toast said:
So What effect do the 8 ACEs have on ps4 gpu?

My (probably incorrect) understanding is its like a ROB in a CPU, the PS4 has more queues therefore it could do more fine grained stuff / have more data stream to choose from when the time comes but thats about it.

blakjedi said:
Do you have a source?

Well vgleaks for one gets the exact same numbers for both of the GPU's as you would out of standard GCN so I think thats evidence enough.

blakjedi · May 27, 2013

Betanumerical said:
memory speeds still matter as it takes time for the data to be transfered, and using the north bridge you have a access to a tiny susbet of memory (the caches) at about 5MB so i guess atleast you dont need to care about the whole 32MB being filled.

No I mentioned "only 32mb" because even at 60 frames, the bandwidth cost would max at 1.92 g/s, call it 2. Lets say the process is only 80% efficient so you only have 24gb/s bandwidth total to start with. Halve that to account for read or write, leaving 12gb/s for all gpu reads from that bus (gpu, esram, dma/move engines). You could still read six times that data rate into esram per second. If you halve the amount of changing data to 16mb your bandwidth usage decreases by roughly half. And you still have the entire 8g of ddr at68gb/s to play in.

I just do not think your bandwidth usage scenario is accurate.

Betanumerical · May 27, 2013

blakjedi said:
No I mentioned "only 32mb" because even at 60 frames, the bandwidth cost would max at 1.92 g/s, call it 2. Lets say the process is only 80% efficient so you only have 24gb/s bandwidth total to start with. Halve that to account for read or write, leaving 12gb/s for all gpu reads from that bus (gpu, esram, dma/move engines). You could still read six times that data rate into esram per second. If you halve the amount of changing data to 16mb your bandwidth usage decreases by roughly half. And you still have the entire 8g of ddr at68gb/s to play in.

I just do not think your bandwidth usage scenario is accurate.

But the problem is the majority of your GPU assets arent going to be in your CPU cache, so your not going to be able to access most of them, you need to use the DDR3 for that you could access maybe a tiny subset of your over all assets through the coherent bus.

blakjedi · May 27, 2013

Betanumerical said:
My (probably incorrect) understanding is its like a ROB in a CPU, the PS4 has more queues therefore it could do more fine grained stuff / have more data stream to choose from when the time comes but thats about it.

Well vgleaks for one gets the exact same numbers for both of the GPU's as you would out of standard GCN so I think thats evidence enough.

When I looked at vgleaks I didn't see that number. I found 32. That said it could be 32 64bit threads per alu which is necessarily half of the 32 bit values.

Betanumerical · May 27, 2013

blakjedi said:
When I looked at vgleaks I didn't see that number. I found 32. That said it be 32 64bit threads per alu which is necessarily half of the 32 bit values.

If you look here, youll see that they actually mean that per SC.

12 SCs * 4 SIMDs * 16 threads/clock = 768 ops/clock

As 4 SIMD's per SC, 16 threads a SIMD, 64 threads/clock per SC.

Hell even the picture from vgleaks suggests its 64 threads/clock per SC with 4 16 wide SIMD's per SC.

We have been over this so many times on the forum, they are both practically standard GCN XBONE is not `4x the power per SC`.

blakjedi · May 27, 2013

Betanumerical said:
But the problem is the majority of your GPU assets arent going to be in your CPU cache, so your not going to be able to access most of them, you need to use the DDR3 for that you could access maybe a tiny subset of your over all assets through the coherent bus.

I'm clearly a newb

Could you explain how the assets get from storage to gpu please? The diagrams are probably confusing me. From what I see on both diagrams all data whether compute or graphical passes through the Northbridge from storage. The gpu memory controller has access to that data through a 68gb/s path through the ddr3 and through the Northbridge at 30gb/s. There is nothing I have seen that indicates that those are not parallel accesses so both pathways must be included in the bandwidth estimate.

The nb connection directly to storage and all other non ddr caches is the key to feeding the esram for specific workloads.

blakjedi · May 27, 2013

Betanumerical said:
If you look here, youll see that they actually mean that per SC.

As 4 SIMD's per SC, 16 threads a SIMD, 64 threads/clock per SC.

Hell even the picture from vgleaks suggests its 64 threads/clock per SC with 4 16 wide SIMD's per SC.

We have been over this so many times on the forum, they are both practically standard GCN XBONE is not `4x the power per SC`.

Actually its not 16 threads per SIMD. Its 64 threads per SIMD but only one SIMD is "active" per clock. The scheduler in each sc picks one of ten wavefronts per cycle and assigns that work to a single SIMD. Per Gubbi and the leak, an instruction takes 4 clocks to complete so each SIMD in each Shafer core gets a single 64 thread workload once every four clocks.

Betanumerical · May 27, 2013

blakjedi said:
I'm clearly a newb

Could you explain how the assets get from storage to gpu please? The diagrams are probably confusing me. From what I see on both diagrams all data whether compute or graphical passes through the Northbridge from storage. The gpu memory controller has access to that data through a 68gb/s path through the ddr3 and through the Northbridge at 30gb/s. There is nothing I have seen that indicates that those are not parallel accesses so both pathways must be included in the bandwidth estimate.

The nb connection directly to storage and all other non ddr caches is the key to feeding the esram for specific workloads.

don't feel bad though every makes mistakes.

The problem with feeding directly from storage is that HDDs have both a horrible read speed and a horrible latency in comparasion the various type of memory laying around, a normal SATA3 HDD probably peaks at less then 1GB/s with latency ranging into the milliseconds where as all the memory buses in the system are atleast 20x faster then that and have a much better latency.

Yes both of the 30GB/s and 68GB/s and 102GB/s are what microsoft use in there bandwith estimation.

They are parallel acccess but it seems to me that you would use the coherent bus to snoop the CPU's cache if you were trying to do something that uses tighter CPU <-> GPU integration then you get on a normal desktop.

blakjedi said:
Actually its not 16 threads per SIMD. Its 64 threads per SIMD but only one SIMD is "active" per clock. The scheduler in each sc picks one of ten wavefronts per cycle and assigns that work to a single SIMD. Per Gubbi and the leak, an instruction takes 4 clocks to complete so each SIMD in each Shafer core gets a single 64 thread workload once every four clocks.

Everyone makes mistake see.

warb · May 27, 2013

Betanumerical said:
But the problem is the majority of your GPU assets arent going to be in your CPU cache, so your not going to be able to access most of them, you need to use the DDR3 for that you could access maybe a tiny subset of your over all assets through the coherent bus.

I'm probably missing much of the picture but why is using DDR3 for that a problem, don't you need to access the same GPU assets from GDDR5 similarly?

Betanumerical · May 27, 2013

warb said:
I'm probably missing much of the picture but why is using DDR3 for that a problem, don't you need to access the same GPU assets from GDDR5 similarly?

Its not a problem. Im just explaining to him why you cannot expect to have the majority of your data over the coherent bus.

blakjedi · May 27, 2013

Betanumerical said:
Its not a problem. Im just explaining to him why you cannot expect to have the majority of your data over the coherent bus.

But I hope you do understand where I am coming from during the process of our discussions. The 30gb/s with 102 on the ESRAM isn't for nought and can be independent of the DDR3 bandwidth costs.

Of course you would have to manage your I/O appropriately to account for the ESRAM - Gubbi gave very good examples of where they are useful. The GPU is obviously by necessity I/O incoherent to ensure that it has full control over flushing its caches. Which may or may not be another plus when you think about.

Any developer that could manage their loads/assets properly for the ESRAM and DDR will have very little problem porting to PS4.

Over the week I will probably spend more time on the PS4 architecture. This weekend was my first opportunity in months to actively look at all this stuff.

Betanumerical · May 27, 2013

blakjedi said:
But I hope you do understand where I am coming from during the process of our discussions. The 30gb/s with 102 on the ESRAM isn't for nought and can be independent of the DDR3 bandwidth costs.

Of course you would have to manage your I/O appropriately to account for the ESRAM - Gubbi gave very good examples of where they are useful. The GPU is obviously by necessity I/O incoherent to ensure that it has full control over flushing its caches. Which may or may not be another plus when you think about.

Yes the coherent link is seperate from the DDR3, but seriously, how many of your assets are going to fit into the CPU caches, and how many do you even want in there?, a cache miss for the CPU is not good, you wouldnt just willy nilly put your GPU assets into the CPU cache so you could use the coherent bus. This is the same reason why we wouldn't count the extra 20GB/s that the coherent bus in the PS4 gives us.

Technical Comparison Sony PS4 and Microsoft Xbox

Similar threads