Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
It's exactly like nvidia did with the 660 Ti and the 550 Ti before it, only with different proportions.
I think these types of comparisons need context.

Even if the implementation was the same the use cases are different and needs to be mentioned when making such comparisons.

On discreet graphics cards, the memory is only being accessed by the gpu and its not managed by the dev.
When you use single pool of same bandwidth as used by all other graphics cards and compare to its use in a console, you have to deal with contention with cpu, so the effective bandwidth isn't the same.

General comment.
The question is, if when you have single bus speed in console and have cpu contention is the effective bandwidth much better than how the xsx is set up when it has contention? If the contention works out the same then split bandwidth has no effective negative effect.
Even if the bandwidth is reduced when accessed by the cpu.
 
Thats what confuses me. So the bandwidth is 560+336 in these instances or...? I would have thought ms would have mentioned being able to add them together but again i dont know anything
There are two physical buses (one 320-bits and one 192-bit wide) to the two GDDR6 RAM pools but only one memory controller so you don't don't add 560Gb and 336Gb bandwidth together. If Microsoft had wanted both RAM pools operating in parallel they'd have needed a second memory controller, a higher level arbitrating controller and they would have needed to come up with cache coherency solution for reads/writes across the two memory controllers.
 
There are two physical buses (one 320-bits and one 192-bit wide) to the two GDDR6 RAM pools but only one memory controller so you don't don't add 560Gb and 336Gb bandwidth together. If Microsoft had wanted both RAM pools operating in parallel they'd have needed a second memory controller, a higher level arbitrating controller and they would have needed to come up with cache coherency solution for reads/writes across the two memory controllers.

So then how is it that they can both be used together for the same tasks at the same time? Isnt that the same thing? Or no?
 
To clarify, the memory ops are interleaved, "you first, my good GPU. Why thank you, Mr. CPU. Here, you have a turn now. Most gracious my kind sir, I do not want to hog, please, access some RAM..." and not in parallel. 'At the same time' means working on jobs at the same time but sharing the RAM IO so accesses aren't concurrent.
 
There are two physical buses (one 320-bits and one 192-bit wide) to the two GDDR6 RAM pools but only one memory controller so you don't don't add 560Gb and 336Gb bandwidth together. If Microsoft had wanted both RAM pools operating in parallel they'd have needed a second memory controller, a higher level arbitrating controller and they would have needed to come up with cache coherency solution for reads/writes across the two memory controllers.
Just to be my pedantic myself, there's no two physical buses, there's 20 physical busses each 16-bit wide. And then there's 20 (16-bit), 10 (32-bit split into 2 channels) or 5 (64-bit split into 4 channels) memory controllers (I'm still not really sure on these, AMD in some slides splits them as 16-bit controllers, in others 64-bit, for same chip.
If you're accessing the 10 GB portion, you can split it everywhere and get that 560GB/s, with the 6 GB portion you split into 6 chips only, leaving you with 336GB/s for that portion.
 
But aren't the memory controllers split into 32-bit/16-bit "chunks". So you could still access fast memory and slow memory at the same time because you are not always using the full interface.

No. You don't access them on the same cycle as anything that's stored on the 1GB chips would have their data stripped across all 10 chips.

A is located in the 6 GB of slow memory
B is located in the 10 GB of fast memory
C is located in the 10 GB of fast memory

AAAAAA => contains the word "SERIES"
BBBBBBBBBB => contains the word "TELEVISION"
CCCCCCCCCC => contains the word "BASKETBALL"

You either read/write AAAAAA or BBBBBBBBBB on one cycle, yielding you with these complete words

you don't read/write AAAAAABBBB /AAAAAACCCC in one cycle because you would get fragmented data that reads either SERIESSION or SERIESBALL.
Sure, you could omit the latter four letters to still give you "SERIES" but that brings you to square one.
 
Cerny's presentation, and some of the past presentations on Sony's compute goals hint at a sensitivity to latency. Latency helped defeat the GPU and DSP's general use in most audio for the PS4, and now there is Tempest. Cerny gave as part of his justification for the high-clock strategy scenarios where the GPU could not fully utilize its width, but could complete smaller tasks faster if the clock speed was raised.

If there is a memory range that may exist in the GPU caches that gets overwritten by a read from the SSD, the old copies in the GPU do not automatically update. RDNA2 is not unique in this, as in almost all situations the GPU cache hierarchies are weakly ordered and slow to propagate changes. In fairness, most data read freshly from IO need additional work to keep consistent even for CPUs.
If you don't want the GPU to be using the wrong data, the data in the GPU needs to be cleared out of the caches before a shader tries to read from those addresses. The PS4's volatile flag was a different cache invalidation optimization, so there does seem to be a history of such tweaks in the Cerny era.
The general cache invalidation process for the GCN/RDNA caches is a long-latency event. It's a pipeline event that blocks most of the graphics pipeline (command processor, CUs, wavefront launch, graphics blocks) until the invalidation process runs its course. This also comes up when CUs read from render targets in GCN, particularly after DCC was introduced and prior to the ROPs becoming L2 clients with Vega. The cache flush events are expensive and advised against heavily.

In the past, a HDD's limited parallelism and long seek times would have eclipsed this process and kept it at a lower frequency.
If the PS5's design expects to be able to fire off many more accesses and use them in a relatively aggressive time frame, then the scrubbers may reduce the impact by potentially reducing the cost of such operations, or reducing the number of full stalls that need to happen.

Great thanks, I think I understand that now. So GPU's without some form of cache invalidation optimization are going to suffer a penalty in terms of increased cache flushes when used in concert with very fast storage solutions which will presumably become far more prevalent next generation. So essentially the cache scrubbers are a GPU performance aid as opposed to something that actually speeds up the IO from the sounds of it?

So that raises a couple of questions:

  1. Does XSX have something similar or is it's GPU going to suffer relatively compared with the PS5 in this regard. Or indeed is it's storage system not fast enough for this to matter that much.
  2. Cache srubbers sound like they'd be sensible in the PC space too unless the programming model on the PC makes this impractical. Assuming not though then I wonder if this is one of the enhancements in PS5 that we might see in future AMD GPU's that Cerny mentioned. If not RDNA2 then perhaps RDNA3 (which may better match the timescales of very fast IO solutions being prevalent in the PC space).
 
I don't really understand that though as it's only if the data changes will in caches that it matters, and how likely is that? Like, you've a load of geometry and textures present drawing some scenery, and then a character. New scenery is loaded. Now to draw that scenery, the caches are fill with character info so they'd naturally reload the scenery data with the latest copy in RAM.

It's only an issue if the GPU is drawing scenery, the scenery data is cached, and new scenery data is loaded. That seems a rare occurrence, that the caches stick with the same data.
 
It's only an issue if the GPU is drawing scenery, the scenery data is cached, and new scenery data is loaded. That seems a rare occurrence, that the caches stick with the same data.

It'll surely impact any non-graphics compute job where large amounts of data are pulled from the SSD. This could be locale/event-specific physics, animation or anything which the GPU is better at than the CPU.
 
True, but it's still only an issue when you reusing data in the cache between loads from SSD. The contents of the GPU (and CPU) caches are constantly changing to fit in the latest geometry, texture, or compute data. Cache data is incredibly temporary. Conceptually, it's only meaningful if mid workload, you are changing the data. So say drawing thousands of trees with a mesh that resides in cache, and then loading in a replacement mesh mid-frame that doesn't get updated within the cache. The moment you request a new model or texture or anything not in the caches, they get replaced with the contents in RAM.

Is that really it? Caches that are constantly prefetching data, constantly being replaced hundreds of times a frame, run the risk of caching stale data?

I guess you're suggesting compute could be more susceptible, but that'd be treating SSD like RAM and RAM as a scratchpad to that SSD data. Load 1 MB of data into memory address 0x12345678, process on compute. Load a different 1 MB of data to 0x12345678. Try to process that on compute but it's using stale data. The moment you get large amounts of data > cache size, you'd end up refreshing the cache data ordinarily.

Cerny's envisions a future of small datasets used this way?
 
Cerny's vision is about speed, efficiency, and getting rid of system latencies everywhere. Being able to stream data directly into the main ram while a frame is being rendered by both the CPU and GPU should be very helpful to reach that goal.

My understanding is that at the end of the rendering the data will be ready so the next frame rendering could start immediately without getting the data into the ram first. And also they should be able to save precious memory bandwidth (and time) as they could use some unused bandwidth during the frame rendering. But I could be wrong.
 
Cerny's vision is about speed, efficiency, and getting rid of system latencies everywhere. Being able to stream data directly into the main ram while a frame is being rendered by both the CPU and GPU should be very helpful to reach that goal.
Sure, but that chances of that happening while that same data address is being used in the GPU caches is minimal, surely?

My understanding is that at the end of the rendering the data will be ready so the next frame rendering could start immediately without getting the data into the ram first.
The cache isn't big enough for that. You've a few MBs of data for the GPUs to use. This'll be texture data, vertex data, render-target buffer, and what-have-you, a small slice of the whole data which is being streamed ahead of requirement by the cache controller. Heck, a 32 bit 1080p render target is 8 mbs, more than will fit in the cache, so there's no way you can hold render target and model geometry in totality at the same time. The GPU works from data in RAM, with the cache sitting between those requests to optimise fetches by prefetching based on data localities and access patterns, constantly changing the pieces of RAM it holds.
 
True, but it's still only an issue when you reusing data in the cache between loads from SSD. The contents of the GPU (and CPU) caches are constantly changing to fit in the latest geometry, texture, or compute data. Cache data is incredibly temporary. Conceptually, it's only meaningful if mid workload, you are changing the data. So say drawing thousands of trees with a mesh that resides in cache, and then loading in a replacement mesh mid-frame that doesn't get updated within the cache. The moment you request a new model or texture or anything not in the caches, they get replaced with the contents in RAM.

Is that really it? Caches that are constantly prefetching data, constantly being replaced hundreds of times a frame, run the risk of caching stale data?

I guess you're suggesting compute could be more susceptible, but that'd be treating SSD like RAM and RAM as a scratchpad to that SSD data. Load 1 MB of data into memory address 0x12345678, process on compute. Load a different 1 MB of data to 0x12345678. Try to process that on compute but it's using stale data. The moment you get large amounts of data > cache size, you'd end up refreshing the cache data ordinarily.

Cerny's envisions a future of small datasets used this way?

Might there be any use of such functionality as pertains to ray tracing?

Reflections and refractions, I assume, will vary more frequently than any geometry or textures. My reasoning being that geometry and texture data is relatively limited by the camera - in line with Cerny's Road to PlayStation presentation, in which he discussed the requirement of ~5GB/s to stream in new data as the player does a 180° turn.

However, something such as a single sheet of shiny silver, shellacked to a sheen won't play as pleasantly and predictably. A single, flat sheet bouncing around would call for reflection data from a moving target, but only for two planes. But stick a bit of a crumple in that sheet of metal, and you've got hundreds of faces, all calling for different, small samples of textures, dotted around the room. Thankfully, the SSD and I/O can handle feeding the GPU with all those little bits of data on the fly.

Might the cache scrubbers help prevent hiccups/gain some efficiency?
 
If Console RealTime RayTracing is anything like PC RealTime RayTracing, then it is BVH bound and rebuilding the BVH incurs performance degradations, so if anything the RT data is less dynamic than anything else.
 
Status
Not open for further replies.
Back
Top