Technical Comparison Sony PS4 and Microsoft Xbox

Status
Not open for further replies.
How much of a difference does using PRT's make to the picture, when you take into account the fact that the texture caches aren't of an infinite size and the 32MB of ESRAM is accessible to the XB1 GPU with lower latency?

Would the 32MB hold enough texture data to make a difference, or would it be fetching from DDR3 so often in a 'modern game' that the latency advantage would be a moot point?

There must be SOME advantage to make you want to take a 1.6B transistor hit, over and above "it should work out in the long term..."?

Well one advantage of the eSRAM is that DDR3 is dirt cheap in comparison to GDDR5. Im not smart enough to know the advantages of the low latency in most situations, nor do I know the average size of textures per frame in a game.

But if we are going to start working out that kind of stuff we should at least start taking some bandwidth away for the CPU's.

So we have 68GB/s - 10GB/s (For the 8 CPU's)

58 / 30 GB/f
1.93 * 1024 MB/f
1976.32 MB/f
1976.32 / 32
61.76 fills / frame

half that 60FPS games of course.
 
The more you want to use the eSRAM the more DDR3 bandwidth will be taken for copying, over all you'll get a net benefit but it shows why its not really fair to add up the bandwidths when you have to copy from one to the other.

For example for a 30FPS game

The DDR3 bandwidth per frame is 68/30 = 2.27GB/f

Assuming you want to use some amount of data in the eSRAM each full write to the eSRAM (all 32MB) will take away 32MB/frame from the DDR3

so if you read to the eSRAM fully (all 32MB) 10 times a frame your DDR3 bandwidth for other stuff is.

68/30 = 2.27GB/f
2.27GB/f - (32 * 10)
2.27GB/f - 320
2324.48MB/f - 320MB/f
2004.48MB/f
Or back to GB/f
1.95GB/f

As Brad Grenz pointe out below it also uses up eSRAM bandwidth do.

To Write

102/30 = 3.4GB/f
3.4GB/f - (32 * 10)
3.4GB/f - 320
3481.6MB/f - 320MB/f
3161.6MB/f
Or back to GB/f
3.1GB/f

To then Read [As Brad Grenz pointed out again :[ i feel stupid today]

3161.6MB/f - 320MB/f
2841.6MB/f

Total bandwidth left over from just copying 320MB and no other reading/writing by GPU or CPU.

2.8GB/f + 1.95GB./f
4.75GB/f

And thats with doing 0 operations on the actual data.

In comparison assuming the same thing from the PS4.

176/30 = 5.86GB/f
5.86GB/f - 320MB/f
6000MB/f - 320MB/f
5680MB/f
Back to GB/f
5.55GB/f

nvm I saw it was addressed. Honestly there isn't any real performance advantage with Xbox's setup, but I'd expect it to be cheaper, especially since eSRAM can be shrunk as MS goes along.
 
I can think of so many games that are unplayable on 1Gb VRAM at 1080p..
That might be true for a few games, but majority of the games run just fine on 1 GB card at 1080p when default (medium) quality settings are used (no AA, no AF). Of course if you start ramping up the antialiasing and/or detail levels, 1 GB cards start to stutter badly in the most demanding games. However these 1 GB cards aren't even powerful enough to run those games at ultra settings (or at high resolutions / antialiasing), so the limited memory doesn't hurt that much in reality.

Next gen will likely change the situation a bit. I expect 2 GB cards to become the minimum requirement to run the games at console texture+AA quality at 1080p. But the developer still needs to design their game to downgrade flawlessly to the 1 GB cards. It would be insane to throw away 70% of the existing PC market.
A 7870 has exactly the same bandwidth as a 7850, has higher ALU, texture and fillrate throughput, yet still posts higher results in every game you throw at it. Which to me suggests very few games, at 1080p, are bound by bandwidth, or else the results would fail to scale.
Yes, it seems that many games are either TEX or ALU bound on 7850 (not bandwidth bound). I don't think that 7850 is fill bound, since it has 72% extra fill compared to 7770, but only a very few games show improvements of that magnitude. 7870 on the other hand seems to be a very well balanced card (extra TEX and ALU seem to be helping it in many recent games).
 
OK,
Thinking about DDR3 + Edram vs GDDR5 bandwidth and game engines..

So i am possibly very wrong here, but i'm hoping someone can point me in the right direction. But isn't a typical transaction going to be more like...

PS4 / standard PC setup
1 GPU needs texture
2 Get texture @ correct mipmap from memory
3 GPU needs next mipmap of same texture
4 GPU goes to GDDR again for next mip map.
repeat for every texel......

XB1
1 GPU needs texture
2 Texture + all mipmaps get copied to Esram (or exist from previous frame)
3 GPU get texture from Esram
4 GPU needs next mipmap of same texture
5 GPU gets next mipmap direct from Esram
repeat for every texel......

The idea being that steps 3, 4 and 5 are MUCH faster on the XB1, so that overcomes the extra time that step 2, for xb1 takes.
This allows for more operations/transfers to happen at once, making better use of the available system resources.
Also this completely ignores the more obvious benefits of keeping your working set of render targets in a a local cache (ie the esram).
Of course more render targets = less space as a texture cache.....

I'm talking way over my heard here, but i would be really interested if anyone could provide some more info on likely rendering optimisations for a rendering engine with a large local cache.

Cheers,
 
Last edited by a moderator:
OK,
Thinking about DDR3 + Edram vs GDDR5 bandwidth and game engines..

So i am possibly very wrong here, but i'm hoping someone can point me in the right direction. But isn't a typical transaction going to be more like...

PS4 / standard PC setup
1 GPU needs texture
2 Get texture @ correct mipmap from memory
3 GPU needs next mipmap of same texture
4 GPU goes to GDDR again for next mip map.
repeat for every texel......

XB1
1 GPU needs texture
2 Texture + all mipmaps get copied to Esram (or exist from previous frame)
3 GPU get texture from Esram
4 GPU needs next mipmap of same texture
5 GPU gets next mipmap direct from Esram
repeat for every texel......

The idea being that steps 3, 4 and 5 are MUCH faster on the XB1, so that overcomes the extra time that step 2, for xb1 takes.
This allows for more operations/transfers to happen at once, making better use of the available system resources.
Also this completely ignores the more obvious benefits of keeping your working set of render targets in a a local cache (ie the esram).
Of course more render targets = less space as a texture cache.....

I'm talking way over my heard here, but i would be really interested if anyone could provide some more info on likely rendering optimisations for a rendering engine with a large local cache.

Cheers,

Faster in what way though? latency wise sure, but bandwidth wise it wouldn't be any faster. and do GPU's really only read in a single mipmap at a time?
 
How much of a difference does using PRT's make to the picture, when you take into account the fact that the texture caches aren't of an infinite size and the 32MB of ESRAM is accessible to the XB1 GPU with lower latency?

Each SIMD in a GCN CU can have 10 different wavefronts in flight, it takes four cycles to complete one instruction. That's a total of 40 cycles of latency tolerance, or 50ns. Hitting in caches is thus critical for performance.

Worst case is a workload where all wavefronts miss the texture caches. If they sample the texture atlas and miss, they almost certainly miss the actual tile pointed to by the atlas as well.

The ESRAM isn't of infinite size either, so I expect XB1 developer to work hard to fit the PRT atlas and active tiles into the part of the ESRAM not used by rendering buffers (let's say half, 16MB).

Cheers
 
One of the thing that "bothers" me and I would like to read about is more on the software side of things or more precisely how some techniques would map to different memory configurations.

Lets put aside everything else for a moment the difference in raw throughput (different numbers of CUs, ROPs, etc.). so let assume those 2 set-ups :
System 1&2 X cpu cores Y GPU "cores", Z ROPs

System 1 is UMA as the ps4, system 2 is like Durango with the matching characteristics wrt bandwidth figures.

This gen we've seen a couples of developers teams vouched for tight G-buffer, though I don't know what are the pro/con of a large and a tight G-buffer.

How could things like in order transparencies could affect the memory footprint in realtime graphics?
It is a bit unclear what A (accumulation?) buffers are but same question?

Overall I'm less concerned by the difference in raw throughput than the "level" of freedom offered by both systems (not a statement more a question).
 
The more you want to use the eSRAM the more DDR3 bandwidth will be taken for copying, over all you'll get a net benefit but it shows why its not really fair to add up the bandwidths when you have to copy from one to the other.

For example for a 30FPS game

The DDR3 bandwidth per frame is 68/30 = 2.27GB/f

Assuming you want to use some amount of data in the eSRAM each full write to the eSRAM (all 32MB) will take away 32MB/frame from the DDR3

so if you read to the eSRAM fully (all 32MB) 10 times a frame your DDR3 bandwidth for other stuff is.

68/30 = 2.27GB/f
2.27GB/f - (32 * 10)
2.27GB/f - 320
2324.48MB/f - 320MB/f
2004.48MB/f
Or back to GB/f
1.95GB/f

As Brad Grenz pointe out below it also uses up eSRAM bandwidth do.

To Write

102/30 = 3.4GB/f
3.4GB/f - (32 * 10)
3.4GB/f - 320
3481.6MB/f - 320MB/f
3161.6MB/f
Or back to GB/f
3.1GB/f

To then Read [As Brad Grenz pointed out again :[ i feel stupid today]

3161.6MB/f - 320MB/f
2841.6MB/f

Total bandwidth left over from just copying 320MB and no other reading/writing by GPU or CPU.

2.8GB/f + 1.95GB./f
4.75GB/f

And thats with doing 0 operations on the actual data.

In comparison assuming the same thing from the PS4.

176/30 = 5.86GB/f
5.86GB/f - 320MB/f
6000MB/f - 320MB/f
5680MB/f
Back to GB/f
5.55GB/f


Still missing something I believe.

Assuming that you read each of the entire 32MB dataset 3 times,
you're only getting 32*10*3= 960MB of real bandwidth available for each frame.
Upping it to the entire dataset to be read 5 times we get 1600MB/frame
(if you don't use the data set multiple times, you might as well as NOT read it to the eSRAM for bandwidth purposes)

Adding it to the DDR3 that provides 3.08GB/frame after reading 10 32MB data sets to the eSRAM,
we will only have 4.04GB/frame for 3 full reads and 4.68GB/frame for 5 full reads in the best of best situations, which I'm sure is utterly impossible to do.

This also ignores the amount of scheduling you'd have to do to properly accomplish to allow each dataset to get 3 ~5 full reads for each dataset before replacing it with another set.
 
Assuming you want to use some amount of data in the eSRAM each full write to the eSRAM (all 32MB) will take away 32MB/frame from the DDR3

Please explain to me why anybody would copy data to ESRAM unless it is a win (ie. data-reuse >2).

Or is your basic assumption is that MS and all XB1 developers are idiots?

Cheers
 
Please explain to me why anybody would copy data to ESRAM unless it is a win (ie. data-reuse >2).

Or is your basic assumption is that MS and all XB1 developers are idiots?

Cheers

All its meant to show is that its disengenious to add up the bandwidth numbers and call it a day, because everytime you copy something into the eSRAM it takes away bandwidth that would be avalible to the DDR3, and it uses 3x the bandwidth that would be required if reading from the DDR3.

Of course its going to be a win and read/written to but that doesnt stop it from taking bandwidth from the DDR3.

Its not meant to show anything else other then that.
 
Please explain to me why anybody would copy data to ESRAM unless it is a win (ie. data-reuse >2).

Or is your basic assumption is that MS and all XB1 developers are idiots?

Cheers

There might be an interesting tidbit in that statement...unless I'm reading too much into it!

Extra work is required by devs to make the most of the ESRAM then? It's not an 'automagic' performance boosting feature?

Will MS be in the position that PS3 was last generation? An opportunity for developers to make use of an interesting and potentially powerful hardware feature (as with SPU's but in a lesser way), but the need to spend time on optimisation to get the best from the hardware?

That approach didn't help Sony much last time round...especially with 3rd party developers.
 
durango_memory.jpg


Looking at the diagram from vgleaks I have a hard time squaring brads and betas description of esram activity and bandwidth usage. Esram is local to the gpu and not sitting within the general memory subsystem paths at all. There is no system bandwidth usage incurred while using ESRAM. It doesn't have to copy or paste anything through main system ram.

The gpu gets 170gb read which is parallel read from esram and ddr. 102 gb of write from the gpu.

I'm not seeing what you guys are saying at all since access to both memory pools are, simultaneously parallel, asynchronous and non-linear. You don't have to copy anything into ddr memory first then into esram to work on it. You can copy directly to esram, ddr3 or both and read at the full data rate of each path. Can anyone clarify?
 
Last edited by a moderator:
durango_memory.jpg


Looking at the diagram from vgleaks I have a hard time squaring brads and betas description of esram activity and bandwidth usage. Esram is local to the gpu and not sitting wiring the general memory subsystem at all. There is no system bandwidth usage incurred. It doesn't have to copy or paste anything through main system ram.

The gpu gets 170gb read which is parallel read from esram and ddr. 102 gb of write from the gpu.

I'm not seeing what you guys are saying at all since access to both memory pools are, parallel, asynchronous and non-linear. Can anyone clarify?

You move data from the DDR3 to the eSRAM. As you can see the only fast bus to eSRAM from the outside world is the DDR3, its the only practical way to get data to it.
 
I think multi platform ports would be very interesting between these 2, if the game is made for PS4's memory layout I can see it being a complete pain in the ass to get it running at the same speed/quality on XboxOne.

On the plus side Xbox One made me go laugh out loud.... PS4 on the other hand... Now that machine has my attention.... At the end of this year I could end up buying my first console since PS2 released.....
 
No. Northbridge.

The northbridge doesnt have any RAM it needs to get its data from somewhere, and where would that be, you have 3 choices, the CPU (thus reducing CPU bandwidth and reducing your memory space to about ~8/10MB), the DDR3 (reducing DDR3 bandwith) or the 9GB/s HDD controller (thus really negating any benifit at all).
 
You move data from the DDR3 to the eSRAM. As you can see the only fast bus to eSRAM from the outside world is the DDR3, its the only practical way to get data to it.

Rendering of shadow buffers is completely GPU<->ESRAM
Filling lighting/materials/G-buffer is read texture from DDR3 or ESRAM and write to ESRAM
Accumulate light, read materials/G-buffer is strictly GPU<->ESRAM

The only time you copy something from DDR3 into ESRAM is when it is a definite win compared to just reading from DDR3.

Cheers
 
Rendering of shadow buffers is completely GPU<->ESRAM
Filling lighting/materials/G-buffer is read texture from DDR3 or ESRAM and write to ESRAM
Accumulate light, read materials/G-buffer is strictly GPU<->ESRAM

The only time you copy something from DDR3 into ESRAM is when it is a definite win compared to just reading from DDR3.

Cheers

Yep that all makes sense, but you can still only read from the DDR3 at 68GB/s when doing that unless the data is already in the eSRAM and either way be it from the GPU ROP's or from the northbridge itd have to get there via another bus (the ROPs being relative free). So the entire eSRAM + DDR3 bandwidth addition is still bunk, even more so when you think about size.
 
The northbridge doesnt have any RAM it needs to get its data from somewhere, and where would that be, you have 3 choices, the CPU (thus reducing CPU bandwidth and reducing your memory space to about ~8/10MB), the DDR3 (reducing DDR3 bandwith) or the 9GB/s HDD controller (thus really negating any benifit at all).

The gpu can read from virtually any cache anywhere on the system. Reading/Writing directly to/from esram is the point. You can do that through the graphics controller and the NB totally skipping the ddr3. Why are you missing that?

I'm probably wrong, but the 30gb/s coherent read/write between gpu and Northbridge may be the source of missing 30gb/s bandwidth talked about by MS and alluded to by either Gubbi or sebbbi I can't remember which.
 
The gpu can read from virtually any cache anywhere on the system. Reading/Writing directly to/from esram is the point. You can do that through the graphics controller and the NB totally skipping the ddr3. Why are you missing that?

I'm probably wrong, but the 30gb/s coherent read/write between gpu and Northbridge may be the source of missing 30gb/s bandwidth talked about by MS and alluded to by either Gubbi or sebbbi I can't remember which.

Yes the GPU can read from virtually any cache (seems to be a big thing in both systems), but it doesnt mean its quick. The coherent bus is only 30GB/s. You'd be better off doing what Gubbi said and writing to it from the GPU's output. Reading from the CPU caches you get a max of 20GB/s.
 
Status
Not open for further replies.
Back
Top