Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

3dilettante · Apr 1, 2020

psorcerer said:
It can. There is no free lunch. Which means most of the "latency-sensitive" work CPU does should be limited by things in its cache.
CPU that spams memory bus with a lot of random requests will die underutilized in any case.
So essentially CPU access pattern should be: prefetch, work in cache, write. Where latency savings all go inside the "work in cache" phase.
Any workload where you need to "scan" a lot of memory should go to GPU. It's not negotiable.

For there to be additional bandwidth overhead due to GPU and CPU memory traffic, the CPU's access patterns don't need to be random, just different from the GPU's. If the CPU isn't hitting the same arrays or happens to be writing something when the GPU is content with mostly reads, there would be some additional cycles lost.
The idea that CPU bandwidth consumption should be minimized if the GPU is bandwidth constrained was the point of the PS4 slide back when it was first released, although it doesn't seem like the zero bandwidth case is all that practical.
As far as "prefetch, work in cache, write" goes, I don't know how broadly I should interpret your wording. While it is preferred for a working set to fit in cache, CPUs don't have full control over whether the cache hierarchy writes to memory, since it's not a local store. There are hardware prefetchers and software prefetch, but there are practical limits to how far ahead they can go for most workloads before bandwidth consumption on unnecessary reads becomes counterproductive, or before the cache starts evicting parts of it. Zen 2 has decently sized caches (not clear what the capacity is for the consoles), but high performance cores will quickly exhaust what they can hold in many cases.

This is outside of cases where the CPUs or DMA controllers are expected to move data into system memory, which would have overhead.

pjbliverpool said:
Would it be possible for SSG to work in concert with an external PCIe 4.0 NVME based SSD if coupled with a Zen2 based system? As I understand it the SSD would communicate directly with the IO die on the CPU but I'm unclear if it could pass the data straight through from there to the GPU memory (over the PCIe 4.0 16x link) without going via the system memory and CPU as I understand is the case for the XSX (and presumably the PS5).

There is peer to peer DMA functionality, and there were somewhat recent Linux changes mentioning it for Zen. Perhaps if the drive works with that it could avoid a trip to main memory.

Metal_Spirit · Apr 1, 2020

Rangers said:
Lets see, PS5 best case SSD is 9GB/s right? Whereas PS5 GPU has 448 GB/s BW and Xbox 560GB/s to the 10GB.

So 1/45 and 1/56 or so. PS5 SSD does not even do what Xbox 360 and PS3 GPU BW were doing IIRC (20-some GB/s IIRC, 2X ~25 for PS3, ~25+EDRAM stuffs for 360)

So if you made a GPU of 10+ TF with even just 200GB/s or 300GB/s, you'd be accused of being majorly BW starved. What is 9 GB/s in that ocean?

if you work with a 1 second offscreen margin to each size, is enough to feed 9 GB of new data to the memory, in other words, replace 56,25% of your global memory content.

Betanumerical · Apr 1, 2020

Rockster said:
Huh? There is just one memory space with 320-bit interface to the memory controller. The only thing that varies is chip densities, so some modules have a greater capacity or addressable range than others. Just because you might request data from a range within one module, that extends beyond the available range of another, does nothing to prevent all modules from being accessed every cycle. As far as how coherent CPU requests impact the GPU, that remains the same regardless.

When I said memory spaces I mean it as, you would need to specifically request that memory be stripped in a way for the speed and access you want. Because Microsoft has only mentioned that there is a fast and slow space it makes me think the parallel access iroboto mentioned is not something that they did.

iroboto · Apr 1, 2020

Betanumerical said:
When I said memory spaces I mean it as, you would need to specifically request that memory be stripped in a way for the speed and access you want. Because Microsoft has only mentioned that there is a fast and slow space it makes me think the parallel access iroboto mentioned is not something that they did.

Or it might be; by the same logic. If somehow 10GB was striped for bandwidth and 6GB for generic.

I’m actually quite curious. I’ve learned a lot the last couple of days.

psorcerer · Apr 1, 2020

AzBat said:
You mean XBSX 1st parties. 3rd parties can still release XBSX exclusives.

They may, but will they target XBSX as a baseline?

Rangers said:
What is 9 GB/s in that ocean?

Pretty close to RAM random read on PC.
Where you cannot allocate-place things in RAM anyway.
So you can think of it like a PC with 16GB VRAM and 100+GB of ~DDR3 RAM.
Still doesn't ring a bell?

pTmdfx · Apr 1, 2020

psorcerer said:
I'm not sure where I can find any "max load" clocks for other hardware? I'm not talking about benchmarks, just the real max load numbers from the vendor. Any examples? "Base clock" is not it.

Base clock is guaranteed for processors out in the market in an optimal operating environment (with some degree of tolerance), or otherwise it would not be called base.

[edit: These parameters are published years ahead of platform launch, so that partners can design against them, and QA/binning have a reference model. That’s also why we don’t have ridiculous news of your AM4 125W capable big brand cooler causing your Ryzen to melt, since we have deterministic laws of chemistry and physics upon which our world of semiconductor manufacturing is built.]

If we put aside the assumption of optimal operating environment, well... almost all modern processors throttle down to the minimum frequency (800 MHz for many AMD CPU designs), until it gets worse to the point where max junction temp is breached and auto shutdown kicks in.

psorcerer · Apr 1, 2020

3dilettante said:
If the CPU isn't hitting the same arrays or happens to be writing something when the GPU is content with mostly reads, there would be some additional cycles lost.

But that's not how CPU should work in a game though...
In the end what is rendered is a final result, therefore CPU needs to work on the same buffers as GPU.
Maybe I'm just not interested in theoretical scenarios, but practical ones.

3dilettante said:
was the point of the PS4 slide

Which was just a warning for the developers. We don't know if PS4 was ever in any right hand side of this graph in any game. Do we?

3dilettante said:
CPUs don't have full control over whether the cache hierarchy writes to memory

Yep, but you can predict things. And profile these.

3dilettante said:
but high performance cores will quickly exhaust what they can hold in many cases

That still brings us back to the main question: what the practical typical loads?

pTmdfx said:
Base clock is guaranteed for processors out in the market in an optimal operating environment (with some degree of tolerance), or otherwise it would not be called base.

If we put aside the assumption of optimal operating environment, well... almost all modern processors throttle down to the minimum frequency (800 MHz for many AMD CPU designs), until it gets worse to the point where max junction temp is breached and auto shutdown kicks in.

Yup. So the "base clock" is still a prediction, albeit a more conservative one. You can make any other even more conservative predictions about hypothetical "100% load" scenario.
Or we can safely assume that for some particularly bad loads we can go a low as possible, and the solution is just not to use that load configuration.

iroboto · Apr 2, 2020

psorcerer said:
Or we can safely assume that for some particularly bad loads we can go a low as possible, and the solution is just not to use that load configuration.

What?
Why? It's doing way more work per clock cycle. That's the only reason it's slowing down so much.

dobwal · Apr 2, 2020

psorcerer said:
Yup. So the "base clock" is still a prediction, albeit a more conservative one. You can make any other even more conservative predictions about hypothetical "100% load" scenario.
Or we can safely assume that for some particularly bad loads we can go a low as possible, and the solution is just not to use that load configuration.

AMD's base clocks are the lowest frequency their gpus will run at in the presence of a power virus.

pTmdfx · Apr 2, 2020

psorcerer said:
Yup. So the "base clock" is still a prediction, albeit a more conservative one. You can make any other even more conservative predictions about hypothetical "100% load" scenario.
Or we can safely assume that for some particularly bad loads we can go a low as possible, and the solution is just not to use that load configuration.

Well... if you play this card, everything in the industry is merely a “conservative prediction”. Conditionality and error tolerance are not equivalent to unpredictability and indeterminism.

Specifications (and the design & validation processes around them) exist to define constraints that, if satisfied, enable the chip to attain repeatable optimal performance as designed over the expected lifespan.

psorcerer · Apr 2, 2020

iroboto said:
Why? It's doing way more work per clock cycle.

Map screen in HZD is also doing a lot of work per cycle. Is it really needed?
Or even: should we optimize our TDP and cooling for that particular case? Why not?

pTmdfx said:
Conditionality and error tolerance are not equivalent to unpredictability and indeterminism.

Agree.

pTmdfx said:
enable the chip to attain repeatable optimal performance as designed over the expected lifespan

The point that "nay sayers" articulate is that somehow MSFT claims about "fixed clock" are much better/honest/valuable than Sony's claims on "fixed power".
While I see them both as pretty optimistic targets. With no real difference.

3dilettante · Apr 2, 2020

psorcerer said:
Pretty close to RAM random read on PC.
Where you cannot allocate-place things in RAM anyway.
So you can think of it like a PC with 16GB VRAM and 100+GB of ~DDR3 RAM.
Still doesn't ring a bell?

Reading data into RAM should be more linear than other scattered workloads may consider random.
GDDR6 DRAM pages are 2KB, and various forms of NAND have similar minimum page sizes. If striping across 16 or 20 GDDR6 channels, each channel would need to fill 2KB each, so 32KB or 40 KB in that scenario. That would be 16 or 20 4KB memory pages, and it might get convoluted to try fiddling with bytes to mess with alignment. It would take a stream of 64 linear writes to populate.

psorcerer said:
But that's not how CPU should work in a game though...
In the end what is rendered is a final result, therefore CPU needs to work on the same buffers as GPU.

For CPU write updates there would be barriers preventing them from working on the same addresses at the same time. The GPU's memory pipeline isn't coupled tightly enough to start reading through the buffer until the CPU signaled it was done. GPU write updates would be round trips to memory, either by not caching or flushing cache lines since the GPU caches cannot be snooped. Although in that case it's better than trying uncached reads by the CPU, which at least for older APUs were massive performance hits.
At that point, one or the other should have moved on to other places in memory, and the horizon for aligned access within DRAM is on the order of 2KB pages or 8-16 if accepting some penalties within bank groups.

Which was just a warning for the developers. We don't know if PS4 was ever in any right hand side of this graph in any game. Do we?

I haven't seen a breakdown for PS4 games on that metric. I'm not sure if disclosure would be allowed. The point would be to encourage them to reduce bandwidth consumption, but given how games continue to show influence from DRAM speed and bandwidth on PC games where the GPU has separate memory I don't expect that they could make the CPU portion negligible.

Yep, but you can predict things. And profile these.

There's a pretty spotty record on that. AMD discourages software prefetch in most instances because there's a limit to how far these predictions hold, and the lower-overhead hardware prefetches tend to win.
However, there has been profiling on the amount of bandwidth increase due to prefetch traffic, and decreasing accuracy the further the prefetch runs ahead.

That still brings us back to the main question: what the practical typical loads?

Going by how PC games have shown measurable benefits with Zen based on memory speed, my conservative guess would be to initially try a safe footprint 30-40 GB/s if working with a similar structure to games that cross platforms. Granted, that is a mixture of latency and bandwidth due to how closely linked the infinity fabric is to memory speed.
On the other hand, the PS4 had <20GB/s for its Jaguar cores, and a coherent Onion bus with 10 GB/s read/write. A ~4x improvement in CPU capability allow extrapolation of 3-4x that for the new platform.
I think the Jaguar module may have had a 16 byte link to the northbridge, likely running half core speed or at the GPU speed since those tended to line up. Two modules might have doubled that.
Zen has twice the width and the fabric is running at 2x or more the speed, so that fits too.

iroboto · Apr 2, 2020

psorcerer said:
Map screen in HZD is also doing a lot of work per cycle. Is it really needed?
Or even: should we optimize our TDP and cooling for that particular case? Why not?

I mean, that's not the same that I'm referring to. That's just unlocked frame rate in which higher frequencies means it can just give it.

if you're doing AVX2 workloads or tons of parallel processing together that is causing a downclock, that's something else entirely.

psorcerer · Apr 2, 2020

3dilettante said:
show influence from DRAM speed and bandwidth on PC games where the GPU has separate memory

Dunno. PC has a different set of trade-offs. And I don't think that streaming RAM->VRAM on PC has no impact on the GPU-accessible bandwidth.
Not to mention that CPU cannot write too much into RAM per frame anyway, too slow.
So I would suspect that on PC most of the time GPU and CPU work on similar data sets in a "read heavy" manner.

3dilettante said:
safe footprint 30-40 GB/s

0.5-1GB per frame? What for?
Pahtfinding for 1000 agents on a real geometry?

3dilettante · Apr 2, 2020

psorcerer said:
Dunno. PC has a different set of trade-offs. And I don't think that streaming RAM->VRAM on PC has no impact on the GPU-accessible bandwidth.

There's minimal performance gain with PCIe 3.0 vs 4.0, so the footprint for streaming should be below 15 GB/s in most situations. Board RAM is on the order of 4-8 GB for many cards, and games are written to be paranoid about streaming like they are on the consoles. The PC memory is not a guaranteed amount, PCIe transfer utilization isn't consistent, and we've seen performance suffer if swapping starts to happen.
Utilization of PCIe transactions favors larger payloads, which translates into more linear accesses in main memory if DMA hasn't bypassed it.
If we are concerned about the impact of CPU data movement to the graphics domain, PCIe 3.0 or 4.0 x16 would appear to give 15-30 GB/s to be concerned about.

Not to mention that CPU cannot write too much into RAM per frame anyway, too slow.

The CCXs with the fabric they have on the desktop could generate 115 GB/s of read traffic and the same amount for writes at the ~1.8 GHz ceiling of the fabric. Although the fabric may restrict things further, such as in the case of only allowing 16 bytes/cycle for writes to the already modest memory controllers.
We don't know yet if the consoles keep to those limits or what may change with to 16-20 channels of GDDR6. The peak values of the CCX are no longer hidden behind the limits of the memory interface.

So I would suspect that on PC most of the time GPU and CPU work on similar data sets in a "read heavy" manner.

For the purposes of DRAM channel utilization, my concern isn't whether the workloads are similar, it's if the data being accessed are the in the exact same few dozen kilobytes during a given memory controller's time window.

0.5-1GB per frame? What for?
Pahtfinding for 1000 agents on a real geometry?

Whatever they want, I'm just giving a conservative amount with hefty safety margins for the lifetime of a platform that hasn't launched yet. Perhaps whatever high-utilization vector code Cerny might have been concerned about. Even when they utilize the local caches well, some can still demand more than average amounts of bandwidth.
I'm also making allowances for system operations or functions that can produce bursts of high bandwidth fractions of a second that may be on a critical path for dependent processes.

Rockster · Apr 2, 2020

3d, this quote from Andrew Goossen makes me think the infinity fabric connection from each CCD is still in that 100GB/s or so range as they've been...
"GPU optimal and standard offer identical performance for CPU audio and file IO. The only hardware component that sees a difference is the GPU."

Rockster · Apr 2, 2020

I also find it funny that Sony, while admitting that the PS5 is unable to run at max clocks at 100% ALU utilization (which is what it would take to hit the max TFLOP figure) 2.223GHz x 36 CU's x 128FP ops/clk, takes the liberty to round that ~10.25 result up to 10.3. While MS on the other hand simply discards .15 TFLOPS of actual compute and rounds down to 12 in their marketing.

Proelite · Apr 2, 2020

Rockster said:
While MS on the other hand simply discards .15 TFLOPS of actual compute and rounds down to 12 in their marketing.

It's because when the yields are known and they bump the clocks up to 12.85TF, they can just put 13TF instead.

Still April 1st.

Globalisateur · Apr 2, 2020

Rangers said:
Lets see, PS5 best case SSD is 9GB/s right? Whereas PS5 GPU has 448 GB/s BW and Xbox 560GB/s to the 10GB.

So 1/45 and 1/56 or so. PS5 SSD does not even do what Xbox 360 and PS3 GPU BW were doing IIRC (20-some GB/s IIRC, 2X ~25 for PS3, ~25+EDRAM stuffs for 360)

So if you made a GPU of 10+ TF with even just 200GB/s or 300GB/s, you'd be accused of being majorly BW starved. What is 9 GB/s in that ocean?

That's not best case. That's the typical speed. Best case is 22GB/s (and we know speeds of 20GB/s are already reached depending of the data compressed, from a actual dev).

MS 4.8GB/s is their best case, like 6GB/s announced (for textures I think, but still not benchmarked on their machine and how it will impact the CPU, it's rather unclear). MS still haven't divulged the typical cases, and I doubt they will do that anyway.

Metal_Spirit · Apr 2, 2020

iroboto said:
RT structures take up 1 GB of VRAM IIRC according to the tests done by DF (@Dictator). So it wouldn't work quite like that.
But using your example: it's much more likely that the processing power of the GPU would die under such heavy load well before the I/O was the limiter, and an SSD could never keep up as Virtual Memory for a GPU to write results off to. It wouldn't last very long either considering how often we modify the contents in memory.

I mean, we have largely gone all the way to 9TF without needing an SSD. And people have been using them in the PC space since 2007 maybe (about 320 GB, I think 256GB was what I was looking at as a boot drive).
We're now just hitting 2020 where we know the slow HDD is the limiter here for those high quality textures that we needed for the higher end resolutions, which required higher level bandwidth, which required a lot more processing power.

Any modern SSD would be enough to break any sort of I/O limit for some time let alone the ones in the console space. And as we were speaking on the topic of bandwidth (asymmetric) earlier, we're talking needing bandwidth well above 500 GB/s for demanding complex scenes with ray tracing. These SSDs aren't anywhere close.

Actually Nvidia states 1 to 2 GB. Consoles have 16. I would say its a fair ammount. RT on Xbox has to be done on the first 10 GB, so its up to 20% memory usage just for the RT structure. Wouldn’t it be nice if you could dump this, even if only part, to the ssd?
I would say so!

As for the SSD... not really. Just a faster SSD will not release you of all constrains. A 10 times faster ssd on the PS4 will only bring you 2x gains, and although proportions may change, this reality is common to all systems. To really take advantage of an SSD you need to get rid of a lot of other restrains. That was what PS5 did. Xbox has changes too, but as far as it’s public, not to the same extent.

We were talking about generic SSD gains. But we all very well know that tweet was a joke about the possible gains an ssd could bring to the ps5 over the X.
In fact this is the console forum!
And in that case, you cannot dissociate the fact that the SSD works in conjunction with those changes. That’s not something available on the PC space for a comparison to be made.

Also I would not say all SSDs would be enough to break any sort of I/O limit for some time. At least not in comparable ways. For instance, both consoles use dedicated compression and several optimizations on I/O.
Yet Microsoft games will support Xbox One. Can they really use these changes for anything meaningfull in game concept and design?
And when the current gen consoles are left behind? PCs cannot reach those levels of data compression without sacrificing CPU performance. They do not have dedicated decompressors and other custom changes.
And PCs are now part of the Xbox platform.
Heck, most PCs have no SSD, and most of the ones that have it, have 120 to 256 GB, half used with windows 10, and other installs.
I see complaints on The last Call of Dutty community over their last games, and the fact that with all the patches the game is now over 100 GB in size, and people simply do not have enough SSD.
Besides, most of them cannot even reach a 1 GB/s transfer speed.
So how will that work? Can we really compare those SSDs and the gains they can bring to performance and gaming design?

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

3dilettante

Metal_Spirit

Betanumerical

iroboto

Daft Funk

psorcerer

pTmdfx

psorcerer

iroboto

Daft Funk

dobwal

pTmdfx

psorcerer

3dilettante

iroboto

Daft Funk

psorcerer

3dilettante

Rockster

Rockster

Proelite

Globalisateur

Globby

Metal_Spirit

Similar threads