Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
Honestly we have no clue what the real world impact is going to be for their alterations. He was singling out the cache scrubbers
One of the reasons I just use the figures provided with view that may make a tangible difference but what they are we don't know.
But then you get into things like how much difference does SFS make etc.

Now MS employee saying that 4.8 was conservative. What does that mean, 4.81, 5.5? So I'll just keep with the 4.8 until we know what's what.
I just think adding x which could mean anything muddies the water a lot, because could be negligible, could be significant.
 
No he didn't give numbers, it's not really possible at this stage. Honestly we have no clue what the real world impact is going to be for their alterations. He was singling out the cache scrubbers as a sony exclusive feature that AMD wouldn't use on PC ("developed just for us"). And he said that during heavy streaming it prevents continuously scrapping the entire gpu cache just because the storage engine DMAed something. I don't know how gpu cache works so I have no idea if it's a big deal or not. Until now I had no idea it needed to be scrapped because another module decides to write something in ram. I'm assuming flushing the gpu caches is a big inefficiency hit.

So does this mean data can DMA'd directly from storage into GPU cache - bypassing main memory?
 
How so? You have RT units capable of performing however many intersect tests per clock. 1 unit at 10 GHz should be able to process the same number as 10 units at 1 GHz, I'd have thought. How does parallelism help with AMD's implementation of RT in RDNA2?

hm... indeed. I'm not sure I understand the (tangible) differences that may arise in sending out work to multiple units at a slower rate vs fewer at higher rate.

Let's say that for a given workload of ten credits, a single extreme unit completes 1 credit in 1/10th of a time. What does that mean for memory transactions at the LLC (i.e. GDDR6)? Would that be 10 memory transactions? Thus when distributing the 10 credits to ten units at 1/10th speed, would they complete this workload at the same time for a single memory transaction?

----

Anyways, MS gave the # of intersections, which corresponds to the texel rate, so I assume it would be a similar meaning for PS5.

I don't really understand how nVidia goes about its RT cores vs rays calculation either. e.g. 72 RT cores * 1.455GHz = 104.76 Giggidies / sec, but the Ray Rate is 10 GigaRays/sec. There seems to be a similar factor of 10 difference in other GPUs.

1 "RT Core" per DCU (WGP)? That would maybe mean (1.825*26/10) ~4.7B rays per sec on the SX, which is close to what 2060 performs.

¯\_(ツ)_/¯

Minecraft RT -> 1080p60 (4k DLSS) on 2080Ti. 1080p30-60 on SX (no DLSS). :unsure::oops::confused::???:
 
Last edited:
Nope. Those are still not accurate specs and uses truncation on one side and rounding up for the other.

First one I see is the TF numbers for both sides is wrong.
SeriesX is 12.155 TF, not 12.
PS5 is 10.28 TF max, not 10.3.
 
I would assume it always need to be written to ram.

I'm trying to work out why the cache scrubbers add particular benefit now and why this is only being done now given that the speed of the data feed from main memory to cache hasn't changed just because the system has a faster SSD.

My assumption is that the data in cache is changing much more frequently thanks to the data in VRAM also changing more frequently although surely that's always been the case as graphics data has increased in size and the way that's been dealt with in the past is through larger caches.

So are we saying cache scrubbers are now needed because cache sizes are disproportionately small compared with the amount of data that's being used per frame as enabled by the new storage designs?

Are cache scrubbers essentially a hack because RDNA2 hasn't been designed to cope with the amount of data per frame that the PS5 will allow as a result of the huge jump in streaming speed?
 
I'm trying to work out why the cache scrubbers add particular benefit now and why this is only being done now given that the speed of the data feed from main memory to cache hasn't changed just because the system has a faster SSD.

Maybe as much as the ID buffer from the Pro.
 
They haven't really told how the accessing works, have they? I'm pretty sure it's something more elegant than 970, where using the last .5GB has only fraction of memory bandwidth to it.

This is not important. This is the memory reserved to the CPU but you need to access it and during the time you access it the bus is only 192 bits. it decreases the overall bandwidth because it is a unified bus.
 
They haven't really told how the accessing works, have they? I'm pretty sure it's something more elegant than 970, where using the last .5GB has only fraction of memory bandwidth to it.

There's nothing else to tell. Once they told the two bandwidth levels you can determine how the RAM is placed.

There are 10 GDDR6 chips with a 32bit bus each (rather 2x16 given GDDR6's parameters, but AFAIK you can't really split those).
6 of those chips have a capacity of 16Gbit / 2GByte (let's call it group A), and 4 of those chips have a capacity of 8Gbit / 1GByte (group B).
Data is always interleaved among all chips to maximize bandwidth, though you can only do that while all chips have free capacity to receive data (usually you can always do that because all the chips have the same capacity).

This means you get 10*32bit = 320bit bandwidth (560GB/s) when the memory controller is using Group B and the first half (1GByte) of Group A.
Once Group B gets full (they only have 1GByte each), you can only use the second half of Group A, which is 6*32bit = 192bit (336GB/s).


It's exactly like nvidia did with the 660 Ti and the 550 Ti before it, only with different proportions.


fOZdX7B.png


zDwS8E6.png




EDIT: The 970 is different because nvidia simply stopped interleaving from working to one 32bit channel, by cutting off its L2. It was essentially a 224bit GPU with one extra 32bit 512MB chip that worked very slowly.
 
Last edited by a moderator:
I'm trying to work out why the cache scrubbers add particular benefit now and why this is only being done now given that the speed of the data feed from main memory to cache hasn't changed just because the system has a faster SSD.

My assumption is that the data in cache is changing much more frequently thanks to the data in VRAM also changing more frequently although surely that's always been the case as graphics data has increased in size and the way that's been dealt with in the past is through larger caches.

So are we saying cache scrubbers are now needed because cache sizes are disproportionately small compared with the amount of data that's being used per frame as enabled by the new storage designs?

Are cache scrubbers essentially a hack because RDNA2 hasn't been designed to cope with the amount of data per frame that the PS5 will allow as a result of the huge jump in streaming speed?
Cerny's presentation, and some of the past presentations on Sony's compute goals hint at a sensitivity to latency. Latency helped defeat the GPU and DSP's general use in most audio for the PS4, and now there is Tempest. Cerny gave as part of his justification for the high-clock strategy scenarios where the GPU could not fully utilize its width, but could complete smaller tasks faster if the clock speed was raised.

If there is a memory range that may exist in the GPU caches that gets overwritten by a read from the SSD, the old copies in the GPU do not automatically update. RDNA2 is not unique in this, as in almost all situations the GPU cache hierarchies are weakly ordered and slow to propagate changes. In fairness, most data read freshly from IO need additional work to keep consistent even for CPUs.
If you don't want the GPU to be using the wrong data, the data in the GPU needs to be cleared out of the caches before a shader tries to read from those addresses. The PS4's volatile flag was a different cache invalidation optimization, so there does seem to be a history of such tweaks in the Cerny era.
The general cache invalidation process for the GCN/RDNA caches is a long-latency event. It's a pipeline event that blocks most of the graphics pipeline (command processor, CUs, wavefront launch, graphics blocks) until the invalidation process runs its course. This also comes up when CUs read from render targets in GCN, particularly after DCC was introduced and prior to the ROPs becoming L2 clients with Vega. The cache flush events are expensive and advised against heavily.

In the past, a HDD's limited parallelism and long seek times would have eclipsed this process and kept it at a lower frequency.
If the PS5's design expects to be able to fire off many more accesses and use them in a relatively aggressive time frame, then the scrubbers may reduce the impact by potentially reducing the cost of such operations, or reducing the number of full stalls that need to happen.
 
To be pedantic, none of the memory is reserved for CPU only or GPU only use, but it would be wise to have the GPU use the 556 GB/s and the CPU use the 336 GB/s. The entire memory is fully accessible by either.
 
Ms's strategy to have two different speeds of ram at once is weirding me out. Can they both be used by the same tasks at once?
 
Yes. The only difference is the bandwidth speed.

Thats what confuses me. So the bandwidth is 560+336 in these instances or...? I would have thought ms would have mentioned being able to add them together but again i dont know anything
 
Ms's strategy to have two different speeds of ram at once is weirding me out. Can they both be used by the same tasks at once?
It's two chunks of memory space, not two different speeds of RAM. Data that fits into the first 10GB can be striped across 10 chips, ergo up to 560GB/s. For data that is put into the upper 1GB addresses of the 2GB chips, there can naturally only be striping across 6 chips, hence 336GB/s access.
 
It's two chunks of memory space, not two different speeds of RAM. Data that fits into the first 10GB can be striped across 10 chips, ergo up to 560GB/s. For data that is put into the upper 1GB addresses of the 2GB chips, there can naturally only be striping across 6 chips, hence 336GB/s access.

I mean of course, everyone knows about that, its first grade stuff..

-cough-
 
Status
Not open for further replies.
Back
Top