Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

iroboto · Mar 18, 2020

ToTTenTranz said:
AFAIK ROP count is the same (64) on both consoles. SeriesX's runs at 1.83GHz, PS5's runs at 2.23GHz.

But bandwidth will be the bottleneck here in which PS5 has less. So raster performance will likely go to XSX in this case as it has more.

MrFox · Mar 18, 2020

Scott_Arm said:
Is it really fast enough to do that though? You're looking at around 352 MB/frame at 60Hz.

Edit: And that's assuming that all of the data you need is stored with the absolute maximum compression ration. And that's a full 16ms read, so it requires at least one frame of buffering just to hide the read.

It's more about the absolute speed at which a player can turn around. So it would load a fraction every frame while turning. The maximum bandwidth case is a full 180 at the max turning speed. If you turning around takes a quarter of a second, the frame rate doesn't matter, you can load 1.3GB raw, to 5.5GB in memory at full compression within a 1/4 sec. Kraken have lossy modes for images for 4:1? I wasn't familiar with that format until today...

Scott_Arm · Mar 18, 2020

almighty15 said:
Cerny did say in the presentation that if it takes a player 0.5 second to spin the camera behind them that's enough time to load in ~4Gb worth of textures from the SSD, if that's true it's insane........developers won't have to have corridors, use other tricks or level designs to hide streaming any more.....it'll give so much more freedom.

I mean ... 31 frames to turn your player around at 60Hz is ... slow. So far all 31 of those frames your view frustrum is moving and you have to read data from the drive, which is still 352 MB/frame at 60Hz in the absolute most ideal circumstances. I mean, that is turning at a glacial pace, and you could load data for outside the periphery to give yourself some buffer room.

psorcerer · Mar 18, 2020

ToTTenTranz said:
AFAIK ROP count is the same (64) on both consoles. SeriesX's runs at 1.83GHz, PS5's runs at 2.23GHz.

According to RDNA whitepaper each "shader engine" has 64-bit memory interface.
So, to reach XSeX 320bit we need 5 shader engines and not 4 (36CUs PS5 256bit memory).
Which gets us to 80ROPs.
But it's all speculation.

almighty15 · Mar 18, 2020

iroboto said:
But bandwidth will be the bottleneck here in which PS5 has less. So raster performance will likely go to XSX in this case as it has more.

Depends on the performance hit from moving between memory pools on XSX..... if you exceed that 10Gb and need to use the other 'slower' RAM you could end up in trouble.

There was a reason Sony completely ditched split memory pools after PS3.

Betanumerical · Mar 18, 2020

almighty15 said:
Depends on the performance hit from moving between memory pools on XSX..... if you exceed that 10Gb and need to use the other 'slower' RAM you could end up in trouble.

There was a reason Sony completely ditched split memory pools after PS3.

The performance hit will just be the decreased bandwidth i think, its not really split memory pools just a slower pool.

psorcerer · Mar 18, 2020

Rangers said:
What is faster in the back end? Don't almost everything that matter (ROPS, TMU's) scale with CU's?

XS: we have (max)56CUs with 320bit bus (and 4 chips at half size, which cut our bus to 192bit) = 10*32bit channels / 56 CUs = 1channel/5.6CUs
PS: we have (max)40CUs with 256bit bus = 8*32bit channels / 40CUs = 1channel/5CUs
PS5 has more bandwidth per ALU

PSman1700 · Mar 18, 2020

psorcerer said:
PS: we have (max)40CUs

Where does the 40Cu come from?

AlNom · Mar 18, 2020

psorcerer said:
According to RDNA whitepaper each "shader engine" has 64-bit memory interface.
So, to reach XSeX 320bit we need 5 shader engines and not 4 (36CUs PS5 256bit memory).
Which gets us to 80ROPs.
But it's all speculation.

You can have multiple arrays per engine, up to 16 rops per array (per rasterizer). The likely configuration is 2 shader engines, 2 shader arrays each, 7 WGPs per array, so total 28 WGPs for 56CUs, with one WGP disabled per SE to get 26 enabled WGPs or 52 CUs, therefore 16*2*2 ROPs.

PS5 is *probably* similarly using the Navi 10 configuration of 5 WGP per array, then 18WGP/36CU, 2 shader engines, 4 shader arrays, so 2*2*16 ROPs.

Fillrates might end up being a wash as despite Anaconda having higher blend rate bandwidth (Read + Write), the higher core clock on PS5 will tend to show higher internal bandwidths and that may come in handy with Delta Compression throughput, although who knows how depending on average compression there. For high frequency frame data (high res textures), the compression efficiency naturally gets worse.

iroboto · Mar 18, 2020

almighty15 said:
Depends on the performance hit from moving between memory pools on XSX..... if you exceed that 10Gb and need to use the other 'slower' RAM you could end up in trouble.

There was a reason Sony completely ditched split memory pools after PS3.

Developers won't split pools for rasterization. That's like purposefully gimping your performance.

Kaotik · Mar 18, 2020

psorcerer said:
According to RDNA whitepaper each "shader engine" has 64-bit memory interface.
So, to reach XSeX 320bit we need 5 shader engines and not 4 (36CUs PS5 256bit memory).
Which gets us to 80ROPs.
But it's all speculation.

Hm? I don't see such thing in it? https://www.amd.com/system/files/documents/rdna-whitepaper.pdf
Memory controllers are connected to L2 slices, L2 slices, while physically happen to line up nicely between shader engines, are shared across whole chip.

The L2 cache is shared across the whole chip and physically partitioned into multiple slices. Four slices of the L2 cache are associated with each 64-bit memory controller to absorb and reduce traffic. The cache is 16-way set-associative and has been enhanced with larger 128-byte cache lines to match the typical wave32 memory request. The slices are flexible and can be configured with 64KB-512KB, depending on the particular product. In the RX 5700 XT, each slice is 256KB and the total capacity is 4MB.

MrFox · Mar 18, 2020

Betanumerical said:
The performance hit will just be the decreased bandwidth i think, its not really split memory pools just a slower pool.

It's a unified pool on an imbalanced bus. To maximize the bandwidth usage it needs all channels to get an equal amount of requests over a time slice, to keep the queue in a healthy range. The simple no-brainer method is to spread the address space equally across all chips, requiring identical size chips. The 10GB is that way. By having an additional 6GB only on some channels, it causes accesses at that lower speed to stall other request as if the entire bus is effectively at the lower speed. If it's used very lightly, it won't have much impact, if at all. But if there's a lot of throughput on this partition, it will cause the average B/W to go down significantly.

The OS part isn't expected to be doing much during gameplay so it's perfect, but the special slower partition will need tl be used carefully as low access buffers, maybe perfect for I/O? What's the lowest access data types in a game engine? Code?

almighty15 · Mar 18, 2020

iroboto said:
Developers won't split pools for rasterization. That's like purposefully gimping your performance.

If they'll want to keep pace with PS5 they will have too, PS5 may have less bandwidth but it has more bandwidth per CU as well as having a single memory pool with a much faster I/O to swap assets in and out of said memory.

Deleted member 13524 · Mar 18, 2020

iroboto said:
But bandwidth will be the bottleneck here in which PS5 has less. So raster performance will likely go to XSX in this case as it has more.

Bandwidth was a bottleneck on the PS4 Pro's 64 ROPs. That was using Polaris ROPs. I don't know if an equivalent bandwidth per pixel fillrate is still a bottleneck for RDNA2.

psorcerer said:
According to RDNA whitepaper each "shader engine" has 64-bit memory interface.
So, to reach XSeX 320bit we need 5 shader engines and not 4 (36CUs PS5 256bit memory).
Which gets us to 80ROPs.
But it's all speculation.

Consoles use SoCs with a ringbus. ROPs aren't directly tied to memory PHYs like they are on GPUs. The PS4 has a 256bit bus and 32 ROPS. The XBOne has a 256bit bus and 16 ROPs. The PS4 Pro has a 256bit bus and 64 ROPs. The XBOneX has a 384bit bus and 32 ROPs.

psorcerer said:
PS5 has more bandwidth per ALU

But less bandwidth per compute throughput, which is a more important metric AFAIK.

AlNom · Mar 18, 2020

almighty15 said:
If they'll want to keep pace with PS5 they will have too, PS5 may have less bandwidth but it has more bandwidth per CU as well as having a single memory pool with a much faster I/O to swap assets in and out of said memory.

The first 10GB is a lot of space just for the framebuffers and textures to begin with. Given multiplatform, it's probably unlikely that they'll push it too hard for the majority since not everyone has >8GB GPUs, so it'll be for the rather wasteful (uncompressed).

psorcerer · Mar 18, 2020

TheAlSpark said:
You can have multiple arrays per engine, up to 16 rops per array (per rasterizer). The likely configuration is 2 shader engines, 2 shader arrays each, 7 WGPs per array, so total 28 WGPs for 56CUs, with one WGP disabled per SE to get 26 enabled WGPs or 52 CUs, therefore 16*2*2 ROPs.

PS5 is similarly using the Navi 10 configuration of 5 WGP per array, then 18WGP/36CU, 2 shader engines, 4 shader arrays, so 2*2*16 ROPs.

Fillrates might end up being a wash as despite Anaconda having higher blend rate bandwidth (Read + Write), the higher core clock on PS5 will tend to show higher internal bandwidths and that may come in handy with Delta Compression throughput, although who knows how depending on average compression there.

Yep it makes a lot of sense.
But it leaves less bandwidth per ROP and per ALU for XSeX.

psorcerer · Mar 18, 2020

PSman1700 said:
Where does the 40Cu come from?

See the https://forum.beyond3d.com/posts/2111547/

iroboto · Mar 18, 2020

almighty15 said:
If they'll want to keep pace with PS5 they will have too, PS5 may have less bandwidth but it has more bandwidth per CU as well as having a single memory pool with a much faster I/O to swap assets in and out of said memory.

CUs can pull from cache as well, so I'm not sure if that's how you want to do it.
Just use tried and true calculations (I think your CU numbers are off)

ToTTenTranz said:
Bandwidth was a bottleneck on the PS4 Pro's 64 ROPs. That was using Polaris ROPs. I don't know if an equivalent bandwidth per pixel fillrate is still a bottleneck for RDNA2.

I think fill rate calculations should be the same though right? (I'm not sure how to account for compression to be honest)

Assuming 64 ROPS
XSX Fill rates (560 GB/s)
@RGBA8 = 1825 Mhz * 64 ROPS * 4 bytes = 467 GB/s (ROP bound) this setting is > PS5 available bandwidth
@RGBA16F = 1825 Mhz * 64 ROPS * 8 bytes = 934 GB/s (bandwidth bound)
@RGBA32F = 1825 Mhz * 64 ROPS * 16 bytes = 1868 GB/s (bandwidth bound)

PS5 Fill rates (448 GB/s)
@RGBA8 = 2230 Mhz * 64 ROPS * 4 bytes = 570 GB/s (bandwidth bound)
@RGBA16F = 2230 Mhz * 64 ROPS * 8 bytes = 1141 GB/s (bandwidth bound)
@RGBA32F = 2230 Mhz * 16 ROPS * 16 bytes = 2283 GB/s (bandwidth bound)

I don't know what levels of precision are needed by developers, I assume some render targets require more than others. But in all scenarios I see XSX being able to write out render targets faster than PS5 in every benchmark just due to available bandwidth even in the case that XSX is ROP bound.

DOOM 2016 "cleverly re-uses old data computed in the previous frames...1331 draw calls, 132 textures and 50 render targets," according to a new article which takes a very detailed look at the process of rendering one 16-millisecond frame. An anonymous Slashdot reader writes:

The more render targets, this greater the difference in performance will be. Luckily the difference in bandwidth isn't that large.

The normal map is stored in a R16G16 float format. The specular map is in R8G8B8A8, the alpha channel contains the smoothness factor.
So DOOM actually cleverly mixes forward and deferred with a hybrid approach. These extra G-Buffers will come in handy when performing additional effects like reflections.

psorcerer · Mar 18, 2020

PSman1700 said:
It's 36CU

It's 40CUs (4disabled for yields) = 36

PSman1700 · Mar 18, 2020

psorcerer said:
It's 40CUs (4disabled for yields) = 36

Yes, 36CU's usable then.

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

iroboto

Daft Funk

MrFox

Deludedly Fantastic

Scott_Arm

psorcerer

almighty15

Betanumerical

psorcerer

PSman1700

AlNom

Moderator

iroboto

Daft Funk

Kaotik

Drunk Member

MrFox

Deludedly Fantastic

almighty15

Deleted member 13524

Guest

AlNom

Moderator

psorcerer

psorcerer

iroboto

Daft Funk

psorcerer

PSman1700

Similar threads