ROP/Bandwidth Consumption & Low-Level GCN Optimizations ala Humus GDC 2014

Starx · Mar 30, 2014

On the XB1, if we are rendering to ESRAM, 64bit just about hits the crossover point between ROP and bandwidth-bound. But even if we render to the relatively slow DDR3 memory, we will still be ROP-bound if the render-target is a typical 32bit texture

http://www.humus.name/Articles/Persson_LowlevelShaderOptimization.pdf

McHuj · Mar 30, 2014

Thanks that's an awesome presentation. For someone who writes DSP assembly all day, it's really interesting to see what shader code looks like underneath.

function · Mar 30, 2014

Yes, that's a very interesting presentation (and includes good annotation too, not just the ppt slides).

The above tallies very well with sebbbi's comments about packing render targets into rgba16 formats to make best use of the rops.

Could anyone clear something up for me? The presentation talks about no interpolation hardware in DX11 / GCN. Would this include texture interpolation as seen in bi/tri/aniso filtering?

If so we might have an explanation as to why the potentially ROP limited XB1 has more resources available for texture filtering in some games than the otherwise higher performing PS4 ...?

Thowllly · Mar 30, 2014

function said:
Could anyone clear something up for me? The presentation talks about no interpolation hardware in DX11 / GCN. Would this include texture interpolation as seen in bi/tri/aniso filtering?

No, interpolation between texels is still done by the texture units.

function said:
If so we might have an explanation as to why the potentially ROP limited XB1 has more resources available for texture filtering in some games than the otherwise higher performing PS4 ...?

I don't see what you mean here, the ps4 is ~41% faster than the xb1 if the filtering is done in the texture unit and ~41% faster than the xb1 if the filtering is done with the ALUs, it makes no difference.

Shortbread · Mar 30, 2014

Starx said:
http://www.humus.name/Articles/Persson_LowlevelShaderOptimization.pdf

Humus is a good guy...

Always enjoyed reading his replies regarding ATI's GPUs (performance, etc...) over at Rage3D.

Davros · Mar 30, 2014

why is there no rgba32 for the ps4 ?

Jcustom · Mar 30, 2014

Davros said:
why is there no rgba32 for the ps4 ?

Because it's already BW bound with rgba16f, no need to give other examples.

Shortbread · Mar 30, 2014

function said:
If so we might have an explanation as to why the potentially ROP limited XB1 has more resources available for texture filtering in some games than the otherwise higher performing PS4 ...?

How did you come to that conclusion? Because the PS4 "RGBA32 filtering" is missing from the slide? Come on now... PS4 has way more math (throughput) on handling that as well.

Shortbread · Mar 30, 2014

Jcustom said:
Because it's already BW bound with rgba16f, no need to give other examples.

No where in the PS4 slide does it say PS4 RGBA16f ROP bound. However, it does state XB1 RP/BW bound at that particular level.

AlNom · Mar 30, 2014

Shortbread said:
Are you kidding or just trolling? No where in the PS4 slide does it say PS4 RGBA16f ROP bound. However, it does state XB1 RP/BW bound at that particular level.

It clearly states RGBA16F is bandwidth bound. Any higher format will still be.... bandwidth bound.

KKRT · Mar 30, 2014

Shortbread said:
Are you kidding or just trolling? No where in the PS4 slide does it say PS4 RGBA16f ROP bound. However, it does state XB1 RP/BW bound at that particular level.

Calm down and check slides again. RGBA16F is BW bound on PS4 there.

Jcustom · Mar 30, 2014

Shortbread said:
Are you kidding or just trolling? No where in the PS4 slide does it say PS4 RGBA16f ROP bound. However, it does state XB1 RP/BW bound at that particular level.

Well, i'm super serious

Seems like we are not watching the same slides, what's the point of show rgba32f for PS4? Of couse the result would be again BW bound.

Shortbread · Mar 30, 2014

Jcustom said:
Well, i'm super serious
Seems like we are not watching the same slides, what's the point of show rgba32f for PS4? Of couse the result would be again BW bound.

But it also states that as well for XB1 "RP/BW", however it also includes BW as well. I see it as plain as day from the screen shot.

So make me understand how one (XB1) is "RP/BW" bound at RGBA:16, yet able to do RGBA:32F within it's given spec. However PS4 cannot given its only "BW" bound at RGBA:16.

That's all I'm asking...

AlNom · Mar 30, 2014

Shortbread said:
So make me understand how one (XB1) is "RP/BW" bound at RGBA:16, yet able to do RGBA:32F within it's given spec. However PS4 cannot given its only "BW" bound at RGBA:16.

It's not a question of capability. It's just showing where each respective HW is clearly ROP bound or clearly bandwidth bound. That's it.

Shortbread · Mar 30, 2014

AlNets said:
It's not a question of capability. It's just showing where each respective HW is clearly ROP bound or clearly bandwidth bound. That's it.

I get that. I do understand that. My initial response was that PS4 is able to handle "32f"... it doesn't have to be stated. And Humus could have simply forgot to mentioned it, or didn't care to.

KKRT · Mar 30, 2014

Shortbread said:
I get that. I do understand that. My initial response was that PS4 is able to handle "32f"... it doesn't have to be stated. And Humus could have simply forgot to mentioned it, or didn't care to.

No one stated that PS4 does not support 128b buffers. And slides were not about supporting things, but just showing bound conditions. If lower precision buffers are already BW bound, there is no need to talk about higher ones, because its obvious.

Shortbread · Mar 30, 2014

KKRT said:
No one stated that PS4 does not support 128b buffers. And slides were not about supporting things, but just showing bound conditions. If lower precision buffers are already BW bound, there is no need to talk about higher ones, because its obvious.

No, it was implied by Function's statement. Which I answered. But thanks for the re-clarification of what I was trying to head off.

arhra · Mar 30, 2014

Shortbread said:
But it also states that as well for XB1 "RP/BW", however it also includes BW as well. I see it as plain as day from the screen shot.

So make me understand how one (XB1) is "RP/BW" bound at RGBA:16, yet able to do RGBA:32F within it's given spec. However PS4 cannot given its only "BW" bound at RGBA:16.

That's all I'm asking...

The XB1 results are listing results separately for ESRAM and main RAM, where they differ. So RGBA8 is ROP-bound to either target, RGBA16F is ROP-bound to ESRAM, while being BW-bound to main memory, and finally RGBA32F is BW-bound to both.

The PS4 slide is simpler, because there's only one BW figure to worry about, and RGBA16F will saturate it, so it can automatically be assumed that any higher format is also BW-bound.

sebbbi · Mar 30, 2014

Great article by Humus. Lots of good information there how to get most out of GCN architecture.

RGBA32F is half rate export on GCN (half fill rate). There's a typo on the slides. Packing data to RGBA32 doesn't "improve" the fill rate (or bandwidth usage) over RGBA16.

Blending doubles the bandwidth usage (read-modify-write). RGBA16F with blending is bandwidth bound on all GCN cards (even 290X). Even if you don't sample any textures or read any vertex data.

function · Mar 30, 2014

Thowllly said:
No, interpolation between texels is still done by the texture units.

Good, thought I'd missed something big there!

I don't see what you mean here, the ps4 is ~41% faster than the xb1 if the filtering is done in the texture unit and ~41% faster than the xb1 if the filtering is done with the ALUs, it makes no difference.

I was thinking in terms of compute taking up a higher proportion ALU resources, and leaving proportionately less for [other stuff]. So you could (in theory) reduce ALU load to shift the bottleneck elsewhere. Though obviously this isn't happening due to texture interpolation.

In a game like Thief (no idea if it uses compute for much), which runs at a %50 higher resolution on PS4, the texture filtering is actually better on the Xbox 1. I was looking for a bottleneck in the Xbox 1 that would mean aniso had a smaller hit, but I guess it might not be anymore complex than being ROP or BW bound...?

Shortbread said:
No, it was implied by Function's statement. Which I answered. But thanks for the re-clarification of what I was trying to head off.

I didn't mention or imply anything about supported PS4 buffer formats. You didn't answer or head off anything. You simply misunderstood the slides. Don't try and drag me into your mistake covering.

ROP/Bandwidth Consumption & Low-Level GCN Optimizations ala Humus GDC 2014

Starx

McHuj

function

None functional

Thowllly

Shortbread

Island Hopper

Davros

Jcustom

Shortbread

Island Hopper

Shortbread

Island Hopper

AlNom

Moderator

KKRT

Jcustom

Shortbread

Island Hopper

AlNom

Moderator

Shortbread

Island Hopper

KKRT

Shortbread

Island Hopper

arhra

sebbbi

function

None functional

Similar threads