Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

anexanhume · Apr 6, 2020

eastmen said:
interesting take. 32GB or even 64gigs would be enough to load in large portions of a game.

I think NAND used as SLC will have better lifetime too.

function · Apr 6, 2020

TheAlSpark said:
Yeah, assuming it sticks to two SEs.

It'd be more important on the geometry side for Lockhart not to be totally gimped as that would actually be more important the lower the resolution for a given set of model assets while not everything can be LOD'ed for a certain triangle: pixel density.

Yeah, I think that makes a good case for keeping the two SE. Perhaps having a setup that could be shared as much as possible between the two was part of their decision making (assuming Lockhart actually exists!)

They could even just halve the number of ROPs per SA for Lockhart to save some die space on there. According to the Navi 10 die shot, eight DCUs should be in the region of 32mm^2 while each 64-bit MC is roughly 13mm^2. If they simply shave off 16 DCUs and 128-bit, that's about 90mm^2 off of Anaconda's size. Hopefully they don't gimp the bandwidth that much further (<192-bit), but 12GB would be a rather simple configuration with a 6x2GB setup.

Maybe LH is in the region of 240-250mm^2 when all is said and done?

Interesting figures! Are you including dropping the L2 cache along with the memory controllers?

With so many fewer CUs per SA in the hypothetical setup, might they also be able to halve L2 cache on the remaining controllers (down to 2MB from the 4MB in RDNA1)? That might save them a bit more area still.

I do wonder about memory though. 6 x GDDR5 would be a lot of bandwidth for a budget device around 4TF, even with 8 Zen 2 cores. Although the CPU and file IO on the XSX only seeing 336GB/s across the whole 16GB, maybe that's a (tenuous) indicator that Lockhart might have the same....

Back on the subject of CUs per shader array ... I remembered that the PS5 audio solution is a customised CU. Looking at the RDNA 1 whitepaper, their True Audio Next solution is running on CUs partitioned on a shader array, with ACE managed queues. Supports software ray traced audio and hundreds of sounds and everything.

So ... maybe Sony's specialised CUs for audio are just bunged on the end of a normal shader array. They won't be counted along with the 36 "regular CUs", so that means that ... maybe ... you don't need equal numbers of CUs per shader engine (maybe).

AlNom · Apr 6, 2020

function said:
Interesting figures! Are you including dropping the L2 cache along with the memory controllers?

Just the MC portion in the die shot. I was not sure how to identify the GPU L2, so I might be rather conservative unless the cache is in the MC. (I just drew a rectangle around the MCs)

With so many fewer CUs per SA in the hypothetical setup, might they also be able to halve L2 cache on the remaining controllers (down to 2MB from the 4MB in RDNA1)? That might save them a bit more area still.

Probably better to just keep the per-slice L2 the same since that will have some effect on external bandwidth pressure + power consumption related to off-die requests.

I do wonder about memory though. 6 x GDDR5 would be a lot of bandwidth for a budget device around 4TF, even with 8 Zen 2 cores. Although the CPU and file IO on the XSX only seeing 336GB/s across the whole 16GB, maybe that's a (tenuous) indicator that Lockhart might have the same....

Maybe they could get away with a cheaper bin (e.g. 12Gbps).

blakjedi · Apr 6, 2020

I found this an interesting part of the tweaktown article regarding the controller speed:"
Based on this, the Xbox Series X's SSD can come up in to 2TB capacities, and theoretically deliver up to 3.75GB/sec sequential reads and writes..." I take it that that is Raw speeds and not compressed?

if so why does MS advertise 2.4 raw and 4.8 compressed instead? Is that why the HW decompressor chip is listed at 6GB/s rating well above MS' listed speeds?

Very confusing unless the difference is for overhead.

iroboto · Apr 6, 2020

blakjedi said:
I found this an interesting part of the tweaktown article regarding the controller speed:"
Based on this, the Xbox Series X's SSD can come up in to 2TB capacities, and theoretically deliver up to 3.75GB/sec sequential reads and writes..." I take it that that is Raw speeds and not compressed?

if so why does MS advertise 2.4 raw and 4.8 compressed instead? Is that why the HW decompressor chip is listed at 6GB/s rating well above MS' listed speeds?

Very confusing unless the difference is for overhead.

Because the speed described is guaranteed bandwidth. Which means under all load conditions (heat) that is what to expect. MS never gave out optimal speeds, we just assumed they were the same.

I am also curious if this implies random read speed. Which would be too fast.

MrFox · Apr 6, 2020

blakjedi said:
I found this an interesting part of the tweaktown article regarding the controller speed:"
Based on this, the Xbox Series X's SSD can come up in to 2TB capacities, and theoretically deliver up to 3.75GB/sec sequential reads and writes..." I take it that that is Raw speeds and not compressed?

if so why does MS advertise 2.4 raw and 4.8 compressed instead? Is that why the HW decompressor chip is listed at 6GB/s rating well above MS' listed speeds?

Very confusing unless the difference is for overhead.

To get 3750 they need to buy the 1200 MT nand chips.

With that specific single core controller, the overhead and signalling and ECC etc... Gives 3750 (out of a 4800 nand bus) available to the host on the other side of the controller.

So they would be using 800 MT parts, which add up to 2500 after removing the same overhead, making 2400 "guaranteed" reasonable. These are widespread and are much less expensive than the cream of the crop.

Sony must be using 533 MT or 667 MT, saving more money on the nand, but spending more on the controller.

blakjedi · Apr 7, 2020

MrFox said:
To get 3750 they need to buy the 1200 MT nand chips.

With that specific single core controller, the overhead and signalling and ECC etc... Gives 3750 (out of a 4800 nand bus) available to the host on the other side of the controller.

So they would be using 800 MT parts, which add up to 2500 after removing the same overhead, making 2400 "guaranteed" reasonable. These are widespread and are much less expensive than the cream of the crop.

Sony must be using 533 MT or 667 MT, saving more money on the nand, but spending more on the controller.

and then create 12 channels compared to 4 to reach the speeds touted in their solution?

MrFox · Apr 7, 2020

blakjedi said:
and then create 12 channels compared to 4 to reach the speeds touted in their solution?

Yep, the bandwidth they wanted seems to be the foundation of the entire design. The flash parts, the custom controller, and the decompression block in the SoC.

RagnarokFF · Apr 20, 2020

dskneo said:
The entire OS could run on virtual ram from the SSD with no issues.

Won't happen. OS needs RAM for background tasks and you want to reduce write with SSD.

Rangers said:
I dont know why MS did that, wish they hadn't.

MS choose the RAM setup, because developers like such a trade-off when it gives them more bandwith. Goosen talked about this in the Inside XsX Digital Foundry article.

Jay · Apr 20, 2020

RagnarokFF said:
MS choose the RAM setup, because developers like such a trade-off when it gives them more bandwith. Goosen talked about this in the Inside XsX Digital Foundry article.

I think that a lot of people believe that it will be handled by the developers hence the concerns.
They see all this low level discussion when it will all sit behind MMU and OS.
So all they need to do is either put it in slow or fast section of memory.
Even if MS allows devs to implement their own memory management, that would just be cutting out part of the OS, The MMU will still expose it as a chunk of memory address.

Even the slow memory is actually relatively fast, what is it, 30GB/s faster than the 1X? Could even use it for graphics stuff, just dont use it for intermediate render targets etc.
Fast enough for textures that are read by the gpu at the start of the frame? Although I don't see any reason would need to do it, just making a point.

Gubbi · Apr 21, 2020

MrFox said:
To get 3750 they need to buy the 1200 MT nand chips.

Very unlikely, IMO. Those will cost a premium and console are very cost sensitive.

I would be surprised if both MS and Sony solutions don't use 8 channels, it doubles the amount of IOPS your storage device can handle, and will be crucial to how the devices are used.

The 2.4GB/s firgure might be a limit of the decompression block, ie. 4.8GB/s, decompressed, is quite a lot. In both cases there is plenty of bandwidth.

Cheers

tunafish · Apr 21, 2020

Gubbi said:
I would be surprised if both MS and Sony solutions don't use 8 channels, it doubles the amount of IOPS your storage device can handle, and will be crucial to how the devices are used.

The channel counts for both flash controller chips are known. Sony uses a custom 12-channel design, while Microsoft uses a PS5019-E19T.

Gubbi · Apr 21, 2020

tunafish said:
The channel counts for both flash controller chips are known. Sony uses a custom 12-channel design, while Microsoft uses a PS5019-E19T.

Alright, surprised by this.

Cheers

fehu · Apr 21, 2020

tunafish said:
The channel counts for both flash controller chips are known. Sony uses a custom 12-channel design, while Microsoft uses a PS5019-E19T.

What does means

CE #
Max: 16

?

BRiT · Apr 21, 2020

tunafish said:
The channel counts for both flash controller chips are known. Sony uses a custom 12-channel design, while Microsoft uses a PS5019-E19T.

That's assuming the linked in is correct. Has it actually been seen on teardowns?

MrFox · Apr 21, 2020

fehu said:
What does means

CE #
Max: 16
?

Chip Enable lines, it allows to put more chips on the same channels to grow the capacity, but only one chip can be enabled per channel at a time, exactly like using more dimms on a PC above the physical channel count. But here it's limited to 2TB total.

The 1.6W controller barely needs a heatsink, the simple conducting tabs to the case are starting to make sense.

3dilettante · Apr 21, 2020

Gubbi said:
Very unlikely, IMO. Those will cost a premium and console are very cost sensitive.

I would be surprised if both MS and Sony solutions don't use 8 channels, it doubles the amount of IOPS your storage device can handle, and will be crucial to how the devices are used.

The 2.4GB/s firgure might be a limit of the decompression block, ie. 4.8GB/s, decompressed, is quite a lot. In both cases there is plenty of bandwidth.

Cheers

The architect for the Series X gave a >6 GB/s throughput for the decompression block, though the decision to not use that as the official number seems to indicate it's not common.
https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs

zupallinere · Apr 21, 2020

3dilettante said:
The architect for the Series X gave a >6 GB/s throughput for the decompression block, though the decision to not use that as the official number seems to indicate it's not common.
https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs

From the piece:

As textures have ballooned in size to match 4K displays, efficiency in memory utilisation has got progressively worse - something Microsoft was able to confirm by building in special monitoring hardware into Xbox One X's Scorpio Engine SoC. "From this, we found a game typically accessed at best only one-half to one-third of their allocated pages over long windows of time," says Goossen.

That is pretty neat.

MrFox · Apr 21, 2020

Within a dataset the compression ratio will vary a lot to reach an average cited as "typical". I think that's the reason for mentioning the peak throughput, since the 1:1 incompressible data gets averaged with the 4:1 things like geometry, or masks, or alphas, or lossy-optimised BCn textures.

Some examples with an imaginary dataset spread as 25% at 1:1, 50% at 2:1, and 25% at 4:1....
WARNING: May contain traces of Cheap Napkin Math (tm)

Reading 100MB uncompressed
25% : 1:1 @ 5.5GB/s (4.54ms)
50% : 2:1 @ 11GB/s (4.54ms)
25% : 4:1 @ 22GB/s (1.14ms)
Total 1.77x compression average
56.25MB on disk
100MB/10.22ms = 9.78GB/s

Reading 100MB uncompressed
25% : 1:1 @ 2.4GB/s (10.41ms)
50% : 2:1 @ 4.8GB/s (10.41ms)
25% : 4:1 @ 6GB/s (4.16ms)
Total 1.77x compression average
56.25MB on disk
100MB/24.98ms = 4.00GB/s

And let's suppose there's a 10% better lossless compression on XBSX reaching a 2x average....

Reading 100MB uncompressed
25% : 1.1:1 @ 2.64GB/s (9.47ms)
50% : 2.2:1 @ 5.28GB/s (9.47ms)
25% : 4.4:1 @ 6GB/s (4.16ms)
Total 1.96x compression average
51.13MB on disk
100MB/23.1ms = 4.33GB/s

Here the problem is that the more they try to raise the compression ratio with BCn optimisers, or BCPack, or RDO, the more it becomes useless if the output is limited to 6GB output rate. That doesn't seem to be the case on Sony's platform, they can use RDO to crank up a "lossy" repacking, and they will get closer and closer to 22GB/s effective.

Caveat: What Cerny called "data that compress particularly well" might not necessarily mean anything that compressed at 4:1, it could be the amount of processing required to decompress specific data which may or may not be a linear relationship with data in/out bandwidths. There might be iterative or recursive operations in the decompression algorithm varying based on the data, and I also have no idea how that works for ASIC implementations, but it seems to matter on CPU.

Caveat 2: We don't know exactly what the 6GB/s peak represent on XBSX either.

Caveat 3: Sony have a 3x advantage in IOPS in addition to the bandwidth, block size might be allowed to be smaller.

tinokun · Apr 21, 2020

Nitpick: remember that the peak is "over 6GB/s", not exactly 6.

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

anexanhume

function

None functional

AlNom

Moderator

blakjedi

iroboto

Daft Funk

MrFox

Deludedly Fantastic

blakjedi

MrFox

Deludedly Fantastic

RagnarokFF

Jay

Gubbi

tunafish

Gubbi

fehu

BRiT

(>• •)>⌐■-■ (⌐■-■)

MrFox

Deludedly Fantastic

3dilettante

zupallinere

MrFox

Deludedly Fantastic

tinokun

Similar threads