ROP/Bandwidth Consumption & Low-Level GCN Optimizations ala Humus GDC 2014

Starx

Regular
179726-1396172332.jpg

a14170-1396172466.jpg


On the XB1, if we are rendering to ESRAM, 64bit just about hits the crossover point between ROP and bandwidth-bound. But even if we render to the relatively slow DDR3 memory, we will still be ROP-bound if the render-target is a typical 32bit texture

http://www.humus.name/Articles/Persson_LowlevelShaderOptimization.pdf
 
Thanks that's an awesome presentation. For someone who writes DSP assembly all day, it's really interesting to see what shader code looks like underneath.
 
Yes, that's a very interesting presentation (and includes good annotation too, not just the ppt slides).

The above tallies very well with sebbbi's comments about packing render targets into rgba16 formats to make best use of the rops.

Could anyone clear something up for me? The presentation talks about no interpolation hardware in DX11 / GCN. Would this include texture interpolation as seen in bi/tri/aniso filtering?

If so we might have an explanation as to why the potentially ROP limited XB1 has more resources available for texture filtering in some games than the otherwise higher performing PS4 ...?
 
Could anyone clear something up for me? The presentation talks about no interpolation hardware in DX11 / GCN. Would this include texture interpolation as seen in bi/tri/aniso filtering?
No, interpolation between texels is still done by the texture units.


If so we might have an explanation as to why the potentially ROP limited XB1 has more resources available for texture filtering in some games than the otherwise higher performing PS4 ...?
I don't see what you mean here, the ps4 is ~41% faster than the xb1 if the filtering is done in the texture unit and ~41% faster than the xb1 if the filtering is done with the ALUs, it makes no difference.
 
If so we might have an explanation as to why the potentially ROP limited XB1 has more resources available for texture filtering in some games than the otherwise higher performing PS4 ...?

How did you come to that conclusion? Because the PS4 "RGBA32 filtering" is missing from the slide? Come on now... PS4 has way more math (throughput) on handling that as well.
 
Because it's already BW bound with rgba16f, no need to give other examples.
No where in the PS4 slide does it say PS4 RGBA16f ROP bound. However, it does state XB1 RP/BW bound at that particular level.
 
Last edited by a moderator:
Are you kidding or just trolling? No where in the PS4 slide does it say PS4 RGBA16f ROP bound. However, it does state XB1 RP/BW bound at that particular level.

It clearly states RGBA16F is bandwidth bound. Any higher format will still be.... bandwidth bound.
 
Are you kidding or just trolling? No where in the PS4 slide does it say PS4 RGBA16f ROP bound. However, it does state XB1 RP/BW bound at that particular level.
Well, i'm super serious :p
Seems like we are not watching the same slides, what's the point of show rgba32f for PS4? Of couse the result would be again BW bound.
 
Well, i'm super serious :p
Seems like we are not watching the same slides, what's the point of show rgba32f for PS4? Of couse the result would be again BW bound.

But it also states that as well for XB1 "RP/BW", however it also includes BW as well. I see it as plain as day from the screen shot.

So make me understand how one (XB1) is "RP/BW" bound at RGBA:16, yet able to do RGBA:32F within it's given spec. However PS4 cannot given its only "BW" bound at RGBA:16.

That's all I'm asking...
 
So make me understand how one (XB1) is "RP/BW" bound at RGBA:16, yet able to do RGBA:32F within it's given spec. However PS4 cannot given its only "BW" bound at RGBA:16.

It's not a question of capability. It's just showing where each respective HW is clearly ROP bound or clearly bandwidth bound. That's it.
 
It's not a question of capability. It's just showing where each respective HW is clearly ROP bound or clearly bandwidth bound. That's it.

I get that. I do understand that. My initial response was that PS4 is able to handle "32f"... it doesn't have to be stated. And Humus could have simply forgot to mentioned it, or didn't care to.
 
I get that. I do understand that. My initial response was that PS4 is able to handle "32f"... it doesn't have to be stated. And Humus could have simply forgot to mentioned it, or didn't care to.

No one stated that PS4 does not support 128b buffers. And slides were not about supporting things, but just showing bound conditions. If lower precision buffers are already BW bound, there is no need to talk about higher ones, because its obvious.
 
No one stated that PS4 does not support 128b buffers. And slides were not about supporting things, but just showing bound conditions. If lower precision buffers are already BW bound, there is no need to talk about higher ones, because its obvious.

No, it was implied by Function's statement. Which I answered. But thanks for the re-clarification of what I was trying to head off. ;)
 
But it also states that as well for XB1 "RP/BW", however it also includes BW as well. I see it as plain as day from the screen shot.

So make me understand how one (XB1) is "RP/BW" bound at RGBA:16, yet able to do RGBA:32F within it's given spec. However PS4 cannot given its only "BW" bound at RGBA:16.

That's all I'm asking...

The XB1 results are listing results separately for ESRAM and main RAM, where they differ. So RGBA8 is ROP-bound to either target, RGBA16F is ROP-bound to ESRAM, while being BW-bound to main memory, and finally RGBA32F is BW-bound to both.

The PS4 slide is simpler, because there's only one BW figure to worry about, and RGBA16F will saturate it, so it can automatically be assumed that any higher format is also BW-bound.
 
Great article by Humus. Lots of good information there how to get most out of GCN architecture.

RGBA32F is half rate export on GCN (half fill rate). There's a typo on the slides. Packing data to RGBA32 doesn't "improve" the fill rate (or bandwidth usage) over RGBA16.

Blending doubles the bandwidth usage (read-modify-write). RGBA16F with blending is bandwidth bound on all GCN cards (even 290X). Even if you don't sample any textures or read any vertex data.
 
No, interpolation between texels is still done by the texture units.

Good, thought I'd missed something big there!

I don't see what you mean here, the ps4 is ~41% faster than the xb1 if the filtering is done in the texture unit and ~41% faster than the xb1 if the filtering is done with the ALUs, it makes no difference.

I was thinking in terms of compute taking up a higher proportion ALU resources, and leaving proportionately less for [other stuff]. So you could (in theory) reduce ALU load to shift the bottleneck elsewhere. Though obviously this isn't happening due to texture interpolation.

In a game like Thief (no idea if it uses compute for much), which runs at a %50 higher resolution on PS4, the texture filtering is actually better on the Xbox 1. I was looking for a bottleneck in the Xbox 1 that would mean aniso had a smaller hit, but I guess it might not be anymore complex than being ROP or BW bound...?

No, it was implied by Function's statement. Which I answered. But thanks for the re-clarification of what I was trying to head off. ;)

I didn't mention or imply anything about supported PS4 buffer formats. You didn't answer or head off anything. You simply misunderstood the slides. Don't try and drag me into your mistake covering.
 
Back
Top