Optimizing shaders

Ylisar · Feb 26, 2010

Hi,

I wonder how people generaly go about optimizing their shaders. We're doing a deferred shader pipe on PC using HLSL / CG. However some shaders are slow and now I need to do some optimizations to hopefuly get them up to speed. What is the most productive way to go about doing this? Is there any awesome tools which people normaly use? Idealy I guess one would want to be able to tweak and reload the shader without a restart, and see the instructions / what's getting pipelined and so on. Pix doesn't seem to be the solution however.

Rys · Feb 26, 2010

Since it appears you're developing on NVIDIA hardware, you want PerfHUD and FXComposer from here:

http://developer.nvidia.com/page/home.html

PerfHUD will let you do runtime profiling and debugging of your application and its shaders (start with the perf hints and tips in the documentation to get you going, then ask specific Qs here or on NV's PerfHUD forum if you need more). It should work quite nicely for deferred shading actually, with the frame scrubber. FXComposer will do offline performance profiling for shaders in all high level shading languages they support on the desktop. As for specific optimisations for specific shaders, that generally requires a bit of architecture knowledge and insight into what the compiler's capable of on your behalf. The tools won't show you the internal pipelining of the ALUs, but it will show you what it'll execute and retire per clock per shader.

PIX is nice for per-pixel history, but weak for most other things (for me anyway, I know other developers make good use of it in their dev pipelines). If you want specific shader optimisation help, there are a great number of people on these forums that either write shader compilers or have in-depth knowledge doing runtime optimisation, so post snippets or full shaders and I'm sure we'll have a go.

Ylisar · Feb 26, 2010

Ah, great, I'll take a look at those right away. The largest culprit is the fragment program of our point lights. We use a deferred shading pipe and what I presume is the standard tricks. This is the code at the moment:

Code:

void deferredPointLight_fp(
                out float4 color : COLOR0,
                float2 screenPos : SV_POSITION,
                uniform sampler2D normalMap : register(s0),
                uniform sampler2D diffuseMap : register(s1),
                uniform sampler2D specularMap : register(s2),
                uniform float4x4 biasedInvViewProj,
                uniform float4 lightDiffuse,
                uniform float4 lightSpecular,
                uniform float4 lightAttenuation,
                uniform float4 cameraPos,
                uniform float3 lightPos,
                uniform float4 viewPortSize )
{
    float2 uv = screenPos.xy / viewPortSize.xy;
    float4 diffuse = tex2D( diffuseMap, uv );
    float4 surfaceNormal = tex2D( normalMap, uv );
    
    surfaceNormal.xyz = expand( surfaceNormal.xyz );
    float4 specular = tex2D( specularMap, uv );
    
    // Pixel depth is in the w channel of the normal map
    float4 pixelPos = float4( uv, surfaceNormal.w, 1.f );
    pixelPos = mul( biasedInvViewProj, pixelPos );
    pixelPos.xyz /= pixelPos.w;
    
    float3 lightVec = lightPos - pixelPos;
    float3 normLightVec = normalize( lightVec );
    float3 viewVec = normalize( cameraPos.xyz - pixelPos );
    float lightDistance = length( lightVec );

    float lightFactor = 1 / ( lightAttenuation.x + lightDistance * lightAttenuation.y + ( lightDistance * lightDistance ) * lightAttenuation.z );
    
    float diffusePower = saturate( dot( normLightVec, surfaceNormal.xyz ) );
    float3 halfAngle = ( viewVec + normLightVec ) / 2;
    
    // Expand specular exponent
    float specExp = specular.w * 1024;
    float specularPower = pow( saturate( dot( surfaceNormal.xyz, halfAngle ) ), specExp );
    
    color = float4( lightFactor * ( ( lightDiffuse.xyz * diffusePower * diffuse.xyz ) + ( lightSpecular.xyz * specularPower * specular.xyz ) ), 1.f );
}

Thanks for the help so far!

Any tips appreciated.

DeanoC · Feb 26, 2010

A nice easy one to start with, remove unnessary divides. In particular send up the reciprocal of your viewport size and replace the divide with a multiple (1st line of the shader code)

Deano

discouraged_one · Feb 28, 2010

The best way is to generate asm code - and walk through it. Trust me, compilers can create awful code out of nowhere. If you want some example - last time I used asm "magic" was: 1 / sqrt(float4) which was compiled to four fragments (per component) rsq() rcp() rcp() instead of of simple rsq()

Ylisar · Mar 1, 2010

I've changed the divide with a multiply as described by DeanoC.

I installed and took a look at FXComposer, it seems rather slick. However it doesn't seem to run our shaders out of the box, so I'm guessing there's some rather lengthy porting to be done in order for FXComposer to be happy. PerfHUD was easier and works a treat. I tried to use the built in editor in PerfHUD to improve iteration times, however it seems really subpar ( it doesn't even recognize that I'm not using a english key layout ). It's acting up in other ways as well, but for profiling and data collection it's super.

I don't have a lot of experience with assembler sadly, I do know the basics however. Is it possible to inline asm in HLSL / CG? I recall the PS3 having rather neat tools for seeing how asm got pipelined and how many cycles each instruction takes, is there such a thing readily available for HLSL / CG or do one need to calculate that manually?

Rys · Mar 1, 2010

You can't supply D3D10+ with asm shaders any more, and I'm in the camp that trusts the high level GPU compiler anyway. When it doesn't do something it should, I file a bug, rather than try and outwit it with asm. Those days are long gone for me anyway, since I'm a D3D person in the main. You can supply binary shaders to most OpenGL implementations though, if you have access to the vendor offline compiler.

There are vendor-specific tools that'll show you estimates of cycle counts, or in AMD's case see the raw ISA instructions from the high level compiler to get a direct indication of what the hardware is doing. Combine that with an understanding of the hardware architecture and you can start to figure out intuitively what the hardware should be doing for most lines of high level code.

As far as PerfHUD and FXComposer bugs go, definitely file them. The team there at NVIDIA is really rather responsive to user suggestions and bug reports if you provide lots of good information.

As for your specific shader, you'll probably burn more than a few clocks with the power expansion, but the rest should run reasonably quickly if you can optimise some more of the divides (things like expand() are 'free' ops for the most part). At a quick glance, there are a couple of places where you could help the compiler by providing consts instead of uniforms too (viewport size should be known up front).

Will have a proper look after work :smile:

Ethatron · Mar 1, 2010

I don't know if micro-optimizations will helps you so much, but anyway:

You do normalize & length, which is the same, just do the length, then the divide by length to normalize, it's a severe operation (at least in math, duno maybe the hardware has a one clock normalize, but I doubt it, it involves a square-root).

I just wonder why your lightFactor is a funny vector-calculation:

[lightAttenuation.x] * [1]
[lightAttenuation.y] * [1 * lightDistance]
[lightAttenuation.z] * [1 * lightDistance * lightDistance]

Ah, it looks like you have three degree polynom in your vector. I suspect you can maybe replace that by a smart aproximation - with exp, if it turns out faster, and if you only use positive values - using exps inverse log will get rid even of the 1/x.

darkblu · Mar 1, 2010

the obvious things were already mentioned, but let me add this:

don't use saturate when you mean to take the non-negative dot product of normalized vectors. use a trivial max(0, x) instead.

Andrew Lauritzen · Mar 1, 2010

Before we go too much further with this, have you run any profiling of your code to see where the bottleneck is? What's your target hardware? This is the first thing you should do before messing around with the shader code. For instance, on a modern GPU, you're almost certainly not going to be math bound even on this shader, so no need to micro-optimize that stuff. On an older GPU it may be worth it (and there are a few more things you can do on that front).

Most likely you're bandwidth bound on reading the G-buffer if you're doing many large lights. This would mean that you want to concentrate more on amortizing this cost by combining multiple lights, doing better compression on the G-buffer, doing better light culling, etc.

But this is all conjecture... you need to run a tool like PerfHUD or Intel's GPA to get an idea of if you're bound on math, bandwidth or something else. Don't worry about optimizing the math unless you determine that's where you're actually bottlenecked (which is rare on modern GPUs).

discouraged_one · Mar 2, 2010

You can't supply D3D10+ with asm shaders any more, and I'm in the camp that trusts the high level GPU compiler anyway. When it doesn't do something it should, I file a bug, rather than try and outwit it with asm. Those days are long gone for me anyway, since I'm a D3D person in the main. You can supply binary shaders to most OpenGL implementations though, if you have access to the vendor offline compiler.

I mean NOT creating shaders in asm, but look at what is generated. You CAN create HLSL code that will compile to optimal asm, but you need to think, and evaluate asm you get.

I recall the PS3 having rather neat tools for seeing how asm got pipelined and how many cycles each instruction takes, is there such a thing readily available for HLSL / CG or do one need to calculate that manually?

On ps3 it is custom version of nvshaderperf. You can also use AMD ShaderAnalyzer.

And yes, i re-iterate, create asm, run profiler on it and think. No matter what they told you (no offense too anyone) you can be ALU bottlenecked even on modern gpu - only because of poor generated code. Although in the case of your shader bottleneck is most likely in fill-rate\tex-rate.

Jawed · Mar 2, 2010

With only 3 texture fetches and one output, this shader is extremely unlikely to be bandwidth bound. It looks ALU bound to me.

What is that expand function?

Jawed

Rys · Mar 2, 2010

Jawed said:
With only 3 texture fetches and one output, this shader is extremely unlikely to be bandwidth bound. It looks ALU bound to me.

What is that expand function?

Jawed

2 * (register - 0.5), Cg only I think.

Ylisar · Mar 2, 2010

It's user defined, and actually exactly what Rys said.

I'll do some more digging around and research asm, thanks for all the suggestions so far.

Jawed · Mar 2, 2010

For what it's worth, the D3D assembly:

Code:

//
// Generated by Microsoft (R) HLSL Shader Compiler 9.26.952.2844
//
// Parameters:
//
//   float4x4 $biasedInvViewProj;
//   float4 $cameraPos;
//   sampler2D $diffuseMap;
//   float4 $lightAttenuation;
//   float4 $lightDiffuse;
//   float3 $lightPos;
//   float4 $lightSpecular;
//   sampler2D $normalMap;
//   sampler2D $specularMap;
//   float4 $viewPortSize;
//
//
// Registers:
//
//   Name               Reg   Size
//   ------------------ ----- ----
//   $biasedInvViewProj c0       4
//   $lightDiffuse      c4       1
//   $lightSpecular     c5       1
//   $lightAttenuation  c6       1
//   $cameraPos         c7       1
//   $lightPos          c8       1
//   $viewPortSize      c9       1
//   $normalMap         s0       1
//   $diffuseMap        s1       1
//   $specularMap       s2       1
//
    ps_3_0
    def c10, -0.5, 0.5, 1024, 1
    dcl vPos.xy
    dcl_2d s0
    dcl_2d s1
    dcl_2d s2
    rcp r0.x, c9.x
    rcp r0.y, c9.y
    mul r0.xy, r0, vPos
    mul r1, r0.y, c1
    mad r1, c0, r0.x, r1
    texld r2, r0, s0
    mad r1, c2, r2.w, r1
    add r2.xyz, r2, c10.x
    add r2.xyz, r2, r2
    add r1, r1, c3
    rcp r0.z, r1.w
    mad r3.xyz, r1, -r0.z, c7
    mad r1.xyz, r1, -r0.z, c8
    dp3 r0.z, r3, r3
    rsq r0.z, r0.z
    dp3 r0.w, r1, r1
    rsq r0.w, r0.w
    mul r1.xyz, r1, r0.w
    rcp r0.w, r0.w
    mad r3.xyz, r3, r0.z, r1
    dp3_sat r0.z, r1, r2
    mul r1.xyz, r0.z, c4
    mul r3.xyz, r3, c10.y
    dp3_sat r0.z, r2, r3
    texld r2, r0, s2
    texld r3, r0, s1
    mul r0.x, r2.w, c10.z
    pow r1.w, r0.z, r0.x
    mul r0.xyz, r1.w, c5
    mul r0.xyz, r2, r0
    mad r0.xyz, r1, r3, r0
    mad r1.x, r0.w, c6.y, c6.x
    mul r0.w, r0.w, r0.w
    mad r0.w, r0.w, c6.z, r1.x
    rcp r0.w, r0.w
    mul oC0.xyz, r0, r0.w
    mov oC0.w, c10.w
// approximately 39 instruction slots used (3 texture, 36 arithmetic)

On ATI you get this for HD4870:

Code:

; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(11) 
      0  y: ADD         ____,  R0.y, -0.5      
         z: ADD         T0.z,  R0.x, -0.5      
         w: MOV         R5.w,  1.0f      
         t: RCP_e       ____,  C9.y      
      1  y: MUL         R1.y,  PV0.y,  PS0      
         t: RCP_e       ____,  C9.x      
      2  x: MUL         R1.x,  T0.z,  PS1      
         y: MUL         R0.y,  PV1.y,  C1.w      
         z: MUL         R1.z,  PV1.y,  C1.x      
         w: MUL         R0.w,  PV1.y,  C1.y      
         t: MUL         R0.z,  PV1.y,  C1.z      
01 TEX: ADDR(128) CNT(3) VALID_PIX 
      3  SAMPLE R2, R1.xyxx, t0, s0
      4  SAMPLE R3, R1.xyxx, t2, s2
      5  SAMPLE R4.xyz_, R1.xyxx, t1, s1
02 ALU: ADDR(43) CNT(72) 
      6  x: MULADD      ____,  C0.x,  R1.x,  R1.z      
         y: MULADD      ____,  C0.w,  R1.x,  R0.y      
         z: MULADD      ____,  C0.z,  R1.x,  R0.z      VEC_201 
         w: MULADD      ____,  C0.y,  R1.x,  R0.w      
         t: ADD*2       T1.x,  R2.x, -0.5      
      7  x: MULADD      ____,  C2.x,  R2.w,  PV6.x      
         y: MULADD      ____,  C2.w,  R2.w,  PV6.y      
         z: MULADD      ____,  C2.z,  R2.w,  PV6.z      
         w: MULADD      ____,  C2.y,  R2.w,  PV6.w      
         t: ADD*2       T2.y,  R2.y, -0.5      
      8  x: ADD         T0.x,  PV7.x,  C3.x      
         y: ADD         ____,  PV7.y,  C3.w      
         z: ADD         T0.z,  PV7.z,  C3.z      
         w: ADD         T0.w,  PV7.w,  C3.y      
         t: ADD*2       T2.z,  R2.z, -0.5      
      9  x: MUL         T2.x,  R3.w,  C10.z      
         t: RCP_e       T1.w,  PV8.y      
     10  x: MULADD      T0.x,  T0.x, -PS9,  C8.x      
         y: MULADD      T0.y,  T0.w, -PS9,  C8.y      
         z: MULADD      T1.z,  T0.z, -PS9,  C8.z      
         w: MULADD      T2.w,  T0.x, -PS9,  C7.x      
     11  x: DOT4        ____,  PV10.x,  PV10.x      
         y: DOT4        ____,  PV10.y,  PV10.y      
         z: DOT4        ____,  PV10.z,  PV10.z      
         w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
         t: MULADD      T1.y,  T0.w, -T1.w,  C7.y      
     12  z: MULADD      T0.z,  T0.z, -T1.w,  C7.z      
         t: RSQ_e       T1.w,  |PV11.x|      
     13  x: DOT4        ____,  T2.w,  T2.w      
         y: DOT4        ____,  T1.y,  T1.y      
         z: DOT4        ____,  PV12.z,  PV12.z      
         w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      
         t: MUL         T0.y,  T0.y,  PS12      
     14  x: MUL         T0.x,  T0.x,  T1.w      
         z: MUL         ____,  T1.z,  T1.w      
         t: RSQ_e       ____,  |PV13.x|      
     15  x: MULADD/2    ____,  T2.w,  PS14,  PV14.x      
         y: MULADD/2    ____,  T1.y,  PS14,  T0.y      
         z: MULADD/2    ____,  T0.z,  PS14,  PV14.z      
         w: MUL         ____,  T2.z,  PV14.z      VEC_120 
         t: RCP_e       T1.y,  T1.w      
     16  x: DOT4        ____,  T1.x,  PV15.x      CLAMP 
         y: DOT4        ____,  T2.y,  PV15.y      CLAMP 
         z: DOT4        ____,  T2.z,  PV15.z      CLAMP 
         w: DOT4        ____,  (0x80000000, 0.0f).x,  0.0f      CLAMP 
         t: MULADD      ____,  T2.y,  T0.y,  PV15.w      
     17  x: MULADD      ____,  T1.x,  T0.x,  PS16      CLAMP 
         z: MUL         ____,  T1.y,  T1.y      
         w: MULADD      ____,  T1.y,  C6.y,  C6.x      
         t: LOG_sat     ____,  PV16.x      
     18  x: MUL         T2.x,  PV17.x,  C4.y      
         y: MULADD      T1.y,  PV17.z,  C6.z,  PV17.w      
         z: MUL         ____,  T2.x,  PS17      
         w: MUL         T1.w,  PV17.x,  C4.x      
         t: MUL         T0.y,  PV17.x,  C4.z      
     19  t: EXP_e       ____,  PV18.z      
     20  x: MUL         ____,  PS19,  C5.x      
         z: MUL         ____,  PS19,  C5.z      
         w: MUL         ____,  PS19,  C5.y      
         t: RCP_e       T0.x,  T1.y      
     21  x: MUL         ____,  R3.y,  PV20.w      
         y: MUL         ____,  R3.x,  PV20.x      
         w: MUL         ____,  R3.z,  PV20.z      
     22  x: MULADD      ____,  T0.y,  R4.z,  PV21.w      
         y: MULADD      ____,  T2.x,  R4.y,  PV21.x      
         z: MULADD      ____,  T1.w,  R4.x,  PV21.y      
     23  x: MUL         R5.x,  PV22.z,  T0.x      
         y: MUL         R5.y,  PV22.y,  T0.x      
         z: MUL         R5.z,  PV22.x,  T0.x      
03 EXP_DONE: PIX0, R5
END_OF_PROGRAM

So, quite ALU bound.

Jawed

Andrew Lauritzen · Mar 2, 2010

Jawed said:
So, quite ALU bound.

Not necessarily... at least one of those texture fetches is going to be 64-bit wide (normal, assuming fx16 - note that you should compress this one better to 2 terms), and the other two are 32-bit (plus constants, which do have some effect). Not to mention, this shader will be iterated and overdrawn a whole pile of times on the same G-buffer requiring the R/Ws again each time (including additive blending). There's enough work going on outside of the shader to consider.

Even with an ALU ratio of ~12:1 it's still not necessarily ALU bound on modern GPUs which often require ratios upwards of 20:1 (!). In this case a lot of it is scalar dependent math so ATI's hardware may not run at full efficiency which will "help" somewhat.

You really need to run this through your actual data-paths rather than just look at the disassembly, which says nothing about the cost of those texture samples and how they interact with latency hiding, etc. It's a fairly simple case here, but even still GPUs typically exhibit highly non-linear optimization characteristics. Replace the math with something simple (NdotL or even just a constant term - but make sure you still use all the inputs so they don't get optimized out!) and see the change in speed. Then replace your sampling or G-buffer with something simple (1x1 texture may not work here... just simplify it to just sampling depth or something) and see how that affects the speed.

I'll think you'll be surprised how much math a modern GPU can hum through without missing a beat. If it does turn out to be ALU bound on your target hardware though, there are some things you can do. I'll save specific suggestions until you let me know what your target hardware is. Also it would help to know how you're doing your light culling, as depending on how efficient that is some trade-offs can be made in the lighting accumulation phase.

Jawed · Mar 2, 2010

Andrew Lauritzen said:
Uhh not necessarily... at least one of those texture fetches is going to be 64-bit wide (normal, assuming fx16 - note that you should compress this one better to 2 terms), and the other two are 32-bit (plus constants, which do have some effect).

If you look at the ATI assembly code you can clearly see that two TEX instructions return 128 bits and the third returns 96 bits.

These three fetches take 3 ideal cycles. The texture addressing, though dependent, accesses texels in a very cache-friendly way. The address for all three fetches is the same and is linearly dependent upon the pixel's 2D location.

Not to mention, this shader will be iterated and overdrawn a whole pile of times on the same G-buffer requiring the R/Ws again each time (including additive blending). There's enough work going on outside of the shader to consider.

If each iteration is ALU bound, running lots of iterations isn't going to change that.

Even with an ALU ratio of ~12:1

It's 7:1 in terms of ideal cycles for the assembly I posted. 4:1 is required to hide all the latency on this hardware for well-cached fetches (such as this shader).

it's still not necessarily ALU bound on modern GPUs which often require ratios upwards of 20:1. In this case a lot of it is scalar dependent math so ATI's hardware may not run at full efficiency which will "help" somewhat.

Did you see the ATI assembly code I posted :???:

This shader as posted has 73% utilisation of the 5-way ALUs. There's a decent amount of instruction level parallelism.

You really need to run this through your actual data-paths rather than just look at the disassembly, which says nothing about the cost of those texture samples and how they interact with latency hiding, etc.

Agreed, it's not possible to simply count instructions and say that's the end of it. But two things are in my favour: the ratio is high and the texturing will be cached very efficiently.

If there were 12 or 14 ALU cycles and there were random texel coordinates for all 3 fetches and adjacent pixels had no coherence in their addressing, I'd hesitate to call it ALU-bound. I'd want a higher ALU:TEX, for sure...

On NVidia the hardware configuration is biased towards texturing in comparison with ATI, which simply makes it more likely to be ALU-bound. In the case of this shader, it's even less likely to be anything other than ALU bound.

Jawed

Andrew Lauritzen · Mar 2, 2010

Jawed said:
If you look at the ATI assembly code you can clearly see that two TEX instructions return 128 bits and the third returns 96 bits.

Right, but I'm talking about the width of the source data and thus the bandwidth requirements. Obviously they return 4x 32-bit float and 3x 32-bit float to the shader.

Jawed said:
These three fetches take 3 ideal cycles. The texture addressing, though dependent, accesses texels in a very cache-friendly way. The address for all three fetches is the same and is linearly dependent upon the pixel's 2D location.

Sure, they are obviously very streaming-friendly accesses, but when you say "3 ideal cycles" you're assuming that it isn't bandwidth limited and perfectly hidden latency. It may well be, but it depends on the hardware (not to mention live register counts, etc). You can't simplify it that much... these instructions obviously don't have a latency of 3 cycles.

Jawed said:
If each iteration is ALU bound, running lots of iterations isn't going to change that.

True, but having lots of these in flight at once affects the latency-hiding ability of the machine.

Jawed said:
It's 7:1 in terms of ideal cycles for the assembly I posted. 4:1 is required to hide all the latency on this hardware for well-cached fetches (such as this shader).

Again, this still depends on live contexts. No way the latency of fetch is 4 cycles...

Jawed said:
Did you see the ATI assembly code I posted This shader as posted has 73% utilisation of the 5-way ALUs. There's a decent amount of instruction level parallelism.

Not bad, indeed, and better than I expected (although those DOT4's are not technically "parallel", despite the special-case in ATI's hardware).

Jawed said:
Agreed, it's not possible to simply count instructions and say that's the end of it. But two things are in my favour: the ratio is high and the texturing will be cached very efficiently.

Yes, true, but this is also a pretty simple shader so I don't expect it to amortize the surrounding logic as much.

Jawed said:
If there were 12 or 14 ALU cycles and there were random texel coordinates for all 3 fetches and adjacent pixels had no coherence in their addressing, I'd hesitate to call it ALU-bound. I'd want a higher ALU:TEX, for sure...

Right, and whether or not that's how it is in practice depends on the lights and culling in the surrounding deferred rendering engine. If there are a couple huge lights and good culling then the lookups will be well cached but lots of small lights will exhibit the pattern that you describe.

I'd still run the tests to be sure... it's easy to be wrong on theoretical analysis, no matter how reasonable it might seem. And regardless, I'd still *always* err on the side of more math and less memory.

Jawed · Mar 2, 2010

As for optimisation of the shader, you can try computing multiple lights per invocation - quite a lot of the shader is doing stuff that's got nothing to do with the light. Either put more lights into constants, or create a set of textures with light properties in them and iterate over the lights (fetch from textures) with a single invocation of the shader.

e.g. 4 lights in constants will take only ~twice as long as doing 4 invocations for one light at a time.

I think NVidia hardware/driver particularly dislikes changing constants of this type, as it requires a complete shader re-compilation for each light.

If you're coding for D3D10 then constant buffers are a good way of getting lots of light data into your shader. CBs can take lots of data and they avoid shader re-compilation issues.

This recent thread should be useful:

http://forum.beyond3d.com/showthread.php?t=56557

Jawed

Jawed · Mar 2, 2010

This is what I get for not having written a deferred engine:

e.g. on ATI HD4890 the original shader has a theoretical throughput of 6476 megapixels per second, assuming it's ALU-bound.

Assuming 44 bytes of texels per pixel results in required bandwidth of 285GB/s at that fillrate. Assuming the texels fetched by the shader map 1:1 to pixels then caching isn't going to help in any meaningful way (since each texel is used just once).

So that's >2x available bandwidth (and not all of that can be used for texels). So the original shader looks to be heavily bandwidth bound with the worst-case of 44 bytes of data.

Assuming 11 bytes of texels per pixel results in 71GB/s, which shouldn't be any trouble, as the GPU has almost 125GB/s of bandwidth.

On NVidia, e.g. GTX285, I guess theoretical throughput is around 4300MP/s. Worst case with 44 bytes per pixel is then ~190GB/s, compared with 159GB/s available.

Jawed

Optimizing shaders

Ylisar

Rys

Graphics @ AMD

Ylisar

DeanoC

Trust me, I'm a renderer person!

discouraged_one

Ylisar

Rys

Graphics @ AMD

Ethatron

darkblu

Andrew Lauritzen

Moderator

discouraged_one

Jawed

Rys

Graphics @ AMD

Ylisar

Jawed

Andrew Lauritzen

Moderator

Jawed

Andrew Lauritzen

Moderator

Jawed

Jawed

Similar threads