Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Dictator · May 1, 2020

Kaotik said:
AMDs desktop cache size is dictated by Epycs, not because latencies would be suboptimal on desktop. The savings you'd get from cutting the cache in half or even 1/4ths isn't worth the cost of developing new chiplet for it.

Yeah - the extra cache in all aspects that Ryzen 3XXX has are just such a great thing for game performance.

Scott_Arm said:
Game consoles shouldn't need large caches like desktops because they're not really multi-tasking like a PC and data accesses should be predictable. As long as devs are thinking about cache alignment of data, and making good use of cache line reads with linear data, a smaller cache should not be a big issue.

I do not want to quote bill gates here, but I think the Cache size between processors will be a differentiating factor in performance in games going forward. Not every aspect of code is going align perfectly with the cache design of a single processor, even if that processor is a baseline. Ambitions and the necessity of releasing a game on time and supporting multiple SKUs, etc... If we look at this gen as an example, a hell of a lot of games most definitely did not just keep everything perfectly aligned in Cache on the Jag.

Kaotik · May 1, 2020

Dictator said:
Yeah - the extra cache in all aspects that Ryzen 3XXX has are just such a great thing for game performance.

Is it though? I'm under the impression of Renoir doing awfully fine in games with it's 8MB L3

Scott_Arm · May 1, 2020

Dictator said:
Yeah - the extra cache in all aspects that Ryzen 3XXX has are just such a great thing for game performance.

I do not want to quote bill gates here, but I think the Cache size between processors will be a differentiating factor in performance in games going forward. Not every aspect of code is going align perfectly with the cache design of a single processor, even if that processor is a baseline. Ambitions and the necessity of releasing a game on time and supporting multiple SKUs, etc... If we look at this gen as an example, a hell of a lot of games most definitely did not just keep everything perfectly aligned in Cache on the Jag.

Having a lot of cache makes things easier, and not all dev studios have the same technical ability. But I think you're seeing a shift in the game industry that's taking a long time, but being embraced more and more. There is more of a shift towards data-oriented design and away from typical object-oriented design. Obviously some studios are further along than others and they're practising it to varying degrees. Doom Eternal primarily focused on making sure all cores were busy rather than being cache friendly, but they most likely are fairly cache friendly as they did have a focus on data-oriented design principles.

The game industry started shifting to C++ and OOP in the early 2000s, but in the late 2000s you had blog posts like the following that generated a lot of interest in game dev:
http://gamesfromwithin.com/data-oriented-design

Since then it's been a slow transition away from OOP again. I would expect you'll see more and more dev studios become more conscious of how they use CPUs this gen, especially as Unity makes a big push in that direction.

Edit: If you want to see Mike Acton being abrasive with OOP advocates, watch the following. It'll give you a good understanding of why Insomniac has always been a top dog in game optimization that's continued since he moved on to Unity.

DavidGraham · May 2, 2020

Shifty Geezer said:
BW is a rate, not a quantity, defined by the amount of data divided by time. You're just looking at the amount of data and not the time needed to read that data. For the period of time the CPU is accessing the RAM, it can read 448 GB/s, meaning a fraction of a second to read some GBs of data. When the audio accesses the RAM, it can read 448 GB/s, meaning a fraction of a second to read some GBs of data. And when the GPU is accessing the RAM, it can read 448 GB/s, meaning a fraction of a second to read some GBs of data.

Where does memory contention fit into this?

iroboto · May 2, 2020

DavidGraham said:
Where does memory contention fit into this?

Yea

I mean I get what Shifty is saying, but yea.. unfortunately.. he wrote it as a rate. What he means is that it gets access to the full bus width to pull from all chips, in which it's achieving a full bandwidth in that respect.
But in a single second, if the CPU goes first for even for a blip, takes up 3000 clock cycles, those are now 3000 less for the GPU within the second, so the rate has to drop. or rather that 448 is divided among the CPU, Audio, and GPU. But because there is contention, for whatever reason I don't understand, there has to be a clearing of things before the next agent can begin pulling what they need, and that results in additional cycles lost. That's not counting all the bandwidth lost from doing weird read/write type requests.

Scott_Arm · May 2, 2020

Totally made up data, but this is how I'd explain contention. This is obviously a completely idealized example just to illustrate the concept.

You can read x number of bits per clock for example.

The area under the graph is the bandwidth over time

So then if you have contention, where the GPU and CPU are taking turn accessing memory you have lowered your bandwidth like so

Shifty Geezer · May 2, 2020

DavidGraham said:
Where does memory contention fit into this?

It doesn't; these rates are given as peak numbers.

iroboto said:
I mean I get what Shifty is saying, but yea.. unfortunately.. he wrote it as a rate. What he means is that it gets access to the full bus width to pull from all chips, in which it's achieving a full bandwidth in that respect. But in a single second, if the CPU goes first for even for a blip, takes up 3000 clock cycles, those are now 3000 less for the GPU within the second, so the rate has to drop.

No, it's a rate. 'One second' doesn't come into it. That's a completely arbitrary time slice that makes for a nice unit of measurement; we could measure it in kilonibbles per twelve microseconds* instead and then you wouldn't be thinking about what happens over a whole second. You can measure BW over one second or one millisecond or one microsecond. The specification for the bus in GB/s is defined as a rate, gigabytes divided by seconds.

If someone is tasked with filling two buckets with water twenty metres apart, and has a hose that can deliver one litre per second, and they spend their timing running back and forth between the buckets filling each for half a second before running to the next one, how would you describe the rate of water from the hose? It's spending most of its time splashing water everywhere but the buckets. What's the peak rate either of the buckets can be filled? 1 litre a second! That's the flow the hose can provide and the specifications of the hose. You may have an incredibly inefficient system where the attained performance is far below that peak, but the specification is one litre per second. If after a minute both buckets have 5 litres in them, the efficiency of the system can be described as ten litres out of a potential 60 litres, so only about 17% efficient, but that's not because of a limitation of the hose.

* Probably what the US would be using if Imperial Measurements had found their way to modern science.

goonergaz · May 2, 2020

Scott_Arm said:
Totally made up data, but this is how I'd explain contention. This is obviously a completely idealized example just to illustrate the concept.

You can read x number of bits per clock for example.
View attachment 3841
The area under the graph is the bandwidth over time

View attachment 3842
So then if you have contention, where the GPU and CPU are taking turn accessing memory you have lowered your bandwidth like so

View attachment 3843

Shifty Geezer said:
It doesn't; these rates are given as peak numbers.
No, it's a rate. 'One second' doesn't come into it. That's a completely arbitrary time slice that makes for a nice unit of measurement; we could measure it in kilonibbles per twelve microseconds* instead and then you wouldn't be thinking about what happens over a whole second. You can measure BW over one second or one millisecond or one microsecond. The specification for the bus in GB/s is defined as a rate, gigabytes divided by seconds.

If someone is tasked with filling two buckets with water twenty metres apart, and has a hose that can deliver one litre per second, and they spend their timing running back and forth between the buckets filling each for half a second before running to the next one, how would you describe the rate of water from the hose? It's spending most of its time splashing water everywhere but the buckets. What's the peak rate either of the buckets can be filled? 1 litre a second! That's the flow the hose can provide and the specifications of the hose. You may have an incredibly inefficient system where the attained performance is far below that peak, but the specification is one litre per second. If after a minute both buckets have 5 litres in them, the efficiency of the system can be described as ten litres out of a potential 60 litres, so only about 17% efficient, but that's not because of a limitation of the hose.

* Probably what the US would be using Imperial Measurements had found their way to modern science.

Great the read tech stuff I can totally understand!

iroboto · May 2, 2020

Shifty Geezer said:
It doesn't; these rates are given as peak numbers.
No, it's a rate. 'One second' doesn't come into it. That's a completely arbitrary time slice that makes for a nice unit of measurement; we could measure it in kilonibbles per twelve microseconds* instead and then you wouldn't be thinking about what happens over a whole second. You can measure BW over one second or one millisecond or one microsecond. The specification for the bus in GB/s is defined as a rate, gigabytes divided by seconds.

If someone is tasked with filling two buckets with water twenty metres apart, and has a hose that can deliver one litre per second, and they spend their timing running back and forth between the buckets filling each for half a second before running to the next one, how would you describe the rate of water from the hose? It's spending most of its time splashing water everywhere but the buckets. What's the peak rate either of the buckets can be filled? 1 litre a second! That's the flow the hose can provide and the specifications of the hose. You may have an incredibly inefficient system where the attained performance is far below that peak, but the specification is one litre per second. If after a minute both buckets have 5 litres in them, the efficiency of the system can be described as ten litres out of a potential 60 litres, so only about 17% efficient, but that's not because of a limitation of the hose.

* Probably what the US would be using Imperial Measurements had found their way to modern science.

Okay. I get that. So why is contention a destroyer of bandwidth rate then?

Deleted member 11852 · May 2, 2020

iroboto said:
Okay. I get that. So why is contention a destroyer of bandwidth rate then?

Bus arbitration is a contributor.

Shifty Geezer · May 2, 2020

iroboto said:
Okay. I get that. So why is contention a destroyer of bandwidth rate then?

That's an efficiency thing. In the case of my bucket filling example, it'd be the overhead in swapping from one bucket to another. Let's say there's a tap that delivers one litre a minute of water and two people, Mr. CPU and Mrs. GPU, both trying to fill their buckets. As they push and shove each other out the way to get their bucket filled, there are moments when neither is being filled.

Bandwidth is no different a measurement as TFlops. A GPU may be described as 10 teraflops, but in real terms it probably only computes maybe 6 trillion sums a second, and maybe worse with inefficient workloads, and better with other wrkloads - we still describe the part based on its peak capability rather than whatever it happens to achieve in some use cases.

function · May 2, 2020

iroboto said:
Okay. I get that. So why is contention a destroyer of bandwidth rate then?

I can't say for sure, but I have a feeling that CPU prioritisation and the different needs of the CPU (latency) and GPU (throughput) might be a big factor on consoles.

Memory latency for CPUs is shown in reviews as being under 100 ns, but as far as I can work out from the internets for GPUs latency can be in the hundreds of ns - I've seen figures given as high as 300 ns or higher. I take this to mean that the GPU is scheduling accesses several deep, with a view towards optimising for maximum throughput.

If a CPU access were to rudely jump in at the front of the queue on a particular channel, I could imagine significant disruption to your carefully ordered GPU accesses leading to dead time on parts of the bus.

milk · May 3, 2020

Wow shifty, that picturesque hose and bucket metaphor was brilliant. If I were an engineer, that mental image of a poor man clumsily running back and forth between two buckets spilling water all over the pavement would forever stay with me and haunt me whenever my systems were not running as optimally as they could.

Inuhanyou · May 3, 2020

I agree. Very excellent metaphor shifty when even technically illiterate people can now understand bandwidth contention between cpu and gpu.

But its weird....if its clearly an issue isnt there a way for the consoles to develop around it to mitigate that? I heard a lot about it on base ps4...and pro was supposedly way more bw starved than that. What would sony for example need to do to maximize bandwidth usage without simply adding more bandwidth to the cost of the machine(still holdin out hope for 16gbps chips)

dobwal · May 3, 2020

iroboto said:
Okay. I get that. So why is contention a destroyer of bandwidth rate then?

For prior AMD APUs and it still true for the XSX, only the gpu is able to fully exploit the bandwidth of RAM.

If the GPU of an APU can pull 100 GBps of bandwidth from RAM but the CPU can only pull 50 GBps then with 50% of accesses devoted to each processor you only end up with a total aggregate bandwidth of 75 GBps.

Devote 75% of accesses to the gpu then aggregate bandwidth is 87.5 GBps with the gpu seeing 75 GBps while the CPU sees 12.5 GBps.

If the CPU (or coherent bus) could fully exploit the bandwidth offered by GDDR6 the the rate across the APU would be an aggregate bandwidth of 100 GBps regardless of how the accesses are allocated. That’s ignoring all the normal aspects of the memory system that affects bandwidth.

tuna · May 3, 2020

Scott_Arm said:
Edit: If you want to see Mike Acton being abrasive with OOP advocates, watch the following. It'll give you a good understanding of why Insomniac has always been a top dog in game optimization that's continued since he moved on to Unity.

Is that the same Mike Acton that decided 60 fps was pointless based on sales data excluding COD and sport games?

iroboto · May 3, 2020

dobwal said:
For prior AMD APUs and it still true for the XSX, only the gpu is able to fully exploit the bandwidth of RAM.

If the GPU of an APU can pull 100 GBps of bandwidth from RAM but the CPU can only pull 50 GBps then with 50% of accesses devoted to each processor you only end up with a total aggregate bandwidth of 75 GBps.

Devote 75% of accesses to the gpu then aggregate bandwidth is 87.5 GBps with the gpu seeing 75 GBps while the CPU sees 12.5 GBps.

If the CPU (or coherent bus) could fully exploit the bandwidth offered by GDDR6 the the rate across the APU would be an aggregate bandwidth of 100 GBps regardless of how the accesses are allocated. That’s ignoring all the normal aspects of the memory system that affects bandwidth.

Right. This is what I’ve known. In this case; I should have used the word aggregate to depict this.

chris1515 · May 3, 2020

https://uploadvr.com/sony-next-gen-vr-controllers-finger-tracking/

https://dl.acm.org/doi/abs/10.1145/3313831.3376712

The video accompanies a new research paper titled ‘Evaluation of Machine Learning Techniques for Hand Pose Estimation on Handheld Device with Proximity Sensor’. Crucially, the work is authored by Kazuyuki Arimatsu and Hideki Mori from Sony Interactive Entertainment. That’s the division of Sony specifically responsible for the PlayStation brand, and not the wider corp.

Inuhanyou · May 3, 2020

tuna said:
Is that the same Mike Acton that decided 60 fps was pointless based on sales data excluding COD and sport games?

I mean spiderman sold over 10 million so he must be correct -shrug-

chris1515 · May 3, 2020

Inuhanyou said:
I mean spiderman sold over 10 million so he must be correct -shrug-

And the last Ratchet and Clank is the best R&C selling title of the franchise.

Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Dictator

Kaotik

Drunk Member

Scott_Arm

DavidGraham

iroboto

Daft Funk

Scott_Arm

Shifty Geezer

uber-Troll!

goonergaz

iroboto

Daft Funk

Deleted member 11852

Guest

Shifty Geezer

uber-Troll!

function

None functional

milk

Like Verified

Inuhanyou

dobwal

tuna

iroboto

Daft Funk

chris1515

Inuhanyou

chris1515

Similar threads