R5xx and G8x - Threading and Texture Latency Hiding

Jawed

Legend
I've been looking more closely at the way the Far Cry 4-light shader compiles on R5xx, and think I've found a critical difference in the way R5xx hides texturing latency versus G8x.

The mechanism R5xx uses to identify thread-switch points are the keywords "sem_wait" and "sem_grab". These are the direct equivalents of the "TEX_SEM_WAIT" and "TEX_SEM_ACQUIRE" flags in CTM programming (section 3.4.2 of the CTM guide).

It turns out that R5xx has one semaphore for texture fetches. This contrasts with G8x which appears to support multiple semaphores - though I haven't found any CUDA or PTX documentation that describes this in detail, only slide 32 of Shebanow's presentation:

b3da001.jpg

T2's latency is hidden until B3. The diagrammed instruction issues are described with the coding: B4[28] T1 B2[15] T2 B1[4] G1 B9[36] G2 B3[16] where:
  • Biii[ccc] - B is a block of ALU instructions, where iii is the instruction count and ccc is clocks taken to issue those instructions
  • Tnnn- is a single texture instruction, where nnn is an ordinal for texture instructions
  • Gnnn - is a gate (semaphore) for the nnn-th texture instruction
The implication is that there are two independent semaphores.

Jawed
 
I just want to work through the Far Cry shader on R5xx, using ATI's GPU Shader Analyzer:

Here's the Cg:
Code:
#define _HDR_FAKE 
#define _NUM_LIGHTS 4 
#define _LIGHT0_TYPE 1 
#define _LIGHT0_ONLYSPEC 0 
#define _LIGHT0_SPECOCCLUSION 0 
#define _LIGHT1_TYPE 1 
#define _LIGHT1_ONLYSPEC 0 
#define _LIGHT1_SPECOCCLUSION 0 
#define _LIGHT2_TYPE 1 
#define _LIGHT2_ONLYSPEC 0 
#define _LIGHT2_SPECOCCLUSION 0 
#define _LIGHT3_TYPE 2 
#define _LIGHT3_ONLYSPEC 0 
#define _LIGHT3_SPECOCCLUSION 0 
int CGC = 0; // through a compiler paramter 
 
        struct vertout 
        { 
          float4 baseTC     : TEXCOORD0; 
          float4 bumpTC     : TEXCOORD1; 
          float4 viewVec    : COLOR0; 
    #ifdef _LIGHT0_TYPE 
          half4  lightVec[ _NUM_LIGHTS ] : TEXCOORD2; 
          float4 projTC      : COLOR1;            
    #endif 
        }; 
 
        struct pixout                                          
        {                                                      
          float4 Color  : COLOR;                              
        };                                                    
 
 
      half4  EXPAND( half4  a ) 
      { 
        half4  result; 
        if (CGC) 
         result = 2.h*(a - 0.5h); 
        else 
          result = a*2.h - 1.h; 
        return result; 
      } 
      half3  EXPAND( half3  a ) 
      { 
        half3  result; 
        if (CGC) 
          result = 2.h*(a - 0.5h); 
        else 
          result = a*2.h - 1.h; 
        return result; 
      } 
      half  EXPAND( half  a ) 
      { 
        half  result; 
        if (CGC) 
          result = 2.h*(a - 0.5h); 
        else 
          result = a*2.h - 1.h; 
        return result; 
      } 
 
      half3  HDREncode( half3  a ) 
      { 
        half3  result; 
#if defined(_HDR) && defined(_HDR_FAKE) 
        result = a * 8 ; 
#else 
        result = a * 2; 
#endif            
        return result; 
      } 
      half3  HDREncodeLM( half3  a ) 
      { 
        half3  result; 
#if defined(_HDR) && defined(_HDR_FAKE) 
        result = a * 8  * 2; 
#else 
        result = a * 4; 
#endif            
        return result; 
      } 
      half3  HDREncodeAmb( half3  a ) 
      { 
        half3  result; 
#if defined(_HDR) && defined(_HDR_FAKE) 
        result = a * 8  / 2; 
#else 
        result = a; 
#endif            
        return result; 
      } 
 
      half3  HDRFogBlend( half3  vColor, half  fFog, half3  vFogColor ) 
      { 
        half3  result; 
#if defined(_HDR) && defined(_HDR_FAKE) 
        result = lerp(HDREncodeAmb(vFogColor), vColor, fFog); 
#else 
        result = vColor; 
#endif          
        return result; 
      } 
 
      half4  HDRFogBlend( half4  vColor, half  fFog, half3  vFogColor ) 
      { 
        half4  result; 
#if defined(_HDR) && defined(_HDR_FAKE) 
        result.xyz = lerp(HDREncodeAmb(vFogColor), vColor, fFog); 
        result.a = vColor.a; 
#else 
        result = vColor; 
#endif          
        return result; 
      } 
 
      half3  GetNormalMap( sampler2D bumpMap, float2 bumpTC ) 
      { 
        half3  bumpNormal; 
#ifdef _PACKED_NORMALMAP 
        bumpNormal.xy = tex2D(bumpMap, bumpTC.xy).xy; 
        bumpNormal.z = sqrt(1 - (bumpNormal.x*bumpNormal.x + bumpNormal.y*bumpNormal.y)); 
#else 
        bumpNormal = EXPAND(tex2D(bumpMap, bumpTC.xy).xyz); 
#endif 
        return bumpNormal; 
      } 
    float4 main(vertout IN,  
        uniform sampler2D baseMap : register(s0),                
        uniform sampler2D bumpMap : register(s1),              
    #ifdef _LIGHT0_TYPE 
        uniform float4 Diffuse[ _NUM_LIGHTS ] : register(c2), 
    #endif 
    #ifdef _LIGHT0_TYPE 
        uniform float4 Specular[ _NUM_LIGHTS ] : register(c6), 
        uniform float4 SpecularExp : register(c14), 
    #endif 
        uniform float4 Ambient : register(c0), 
                                                                                                                                                                                                                                                                        #ifdef _LIGHT0_TYPE 
        uniform sampler2D attenMap : register(s2), 
        uniform samplerCUBE projMap : register(s3), 
    #endif 
    #ifdef _HDRLM 
        uniform sampler2D lightMapHDR : register(s4), 
    #endif      
        uniform float4 AttenInfo : register(c1)  
      , uniform float3 GlobalFogColor : register(c31)) : COLOR 
{ 
  float4 _finColor = (float4)0; 
        float2 baseTC = IN.baseTC.xy; 
        float2 bumpTC = IN.bumpTC.xy; 
        half3  viewVec = normalize(IN.viewVec.xyz); 
        half4  decalColor = tex2D(baseMap, baseTC.xy); 
        half  fAlpha = decalColor.a; 
        half3  bumpNormal = GetNormalMap(bumpMap, bumpTC.xy); 
                                                                                                                               #ifndef _PACKED_NORMALMAP                                                  
        bumpNormal.xyz = normalize(bumpNormal.xyz); 
    #endif 
        half3  vFinalDif = half3  (0,0,0); 
        half3  vFinalSpec = half3  (0,0,0);  
        half  NdotL; 
    #ifdef _LIGHT0_TYPE 
        int aLType[ _NUM_LIGHTS ]; 
        aLType[0] = _LIGHT0_TYPE; 
      #ifdef _LIGHT1_TYPE 
        aLType[1] = _LIGHT1_TYPE; 
      #endif 
      #ifdef _LIGHT2_TYPE 
        aLType[2] = _LIGHT2_TYPE; 
      #endif 
      #ifdef _LIGHT3_TYPE 
        aLType[3] = _LIGHT3_TYPE; 
      #endif                                                                            
        bool aBOnlySpec[ _NUM_LIGHTS ]; 
      #ifdef _LIGHT0_ONLYSPEC 
        aBOnlySpec[0] = _LIGHT0_ONLYSPEC; 
      #endif 
      #ifdef _LIGHT1_ONLYSPEC 
        aBOnlySpec[1] = _LIGHT1_ONLYSPEC; 
      #endif 
      #ifdef _LIGHT2_ONLYSPEC 
        aBOnlySpec[2] = _LIGHT2_ONLYSPEC; 
      #endif 
      #ifdef _LIGHT3_ONLYSPEC 
        aBOnlySpec[3] = _LIGHT3_ONLYSPEC; 
      #endif 
        bool aBSpecOcclusion[ _NUM_LIGHTS ]; 
      #ifdef _LIGHT0_SPECOCCLUSION 
        aBSpecOcclusion[0] = _LIGHT0_SPECOCCLUSION; 
      #endif 
      #ifdef _LIGHT1_SPECOCCLUSION 
        aBSpecOcclusion[1] = _LIGHT1_SPECOCCLUSION; 
      #endif 
      #ifdef _LIGHT2_SPECOCCLUSION 
        aBSpecOcclusion[2] = _LIGHT2_SPECOCCLUSION; 
      #endif 
      #ifdef _LIGHT3_SPECOCCLUSION 
        aBSpecOcclusion[3] = _LIGHT3_SPECOCCLUSION; 
      #endif 
        half  fAttenFunction = 4.f/16.f; 
        for (int i=0; i<_NUM_LIGHTS; i++) 
        { 
 
          half  atten = 1; 
          if (aLType[i] != 0) 
          { 
            half  dist = length(IN.lightVec[i].xyz) * AttenInfo[i]; 
            atten = tex2D(attenMap, float2(dist, fAttenFunction)); 
          } 
 
 
          half3  filterColor = half3  (1,1,1); 
          if (aLType[i] == 2) 
            filterColor = texCUBE(projMap, IN.projTC.xyz); 
 
          half3  lVec = normalize(IN.lightVec[i].xyz); 
 
 
          if (!aBOnlySpec[i]) 
          { 
 
            NdotL = saturate(dot(lVec.xyz, bumpNormal.xyz)); 
 
            half3  dif = decalColor.xyz * NdotL * ( half3  )Diffuse[i].xyz * atten * filterColor.xyz; 
            dif.xyz = HDREncode(dif.xyz);    
            vFinalDif += dif.xyz; 
          } 
 
          half3  reflVec = (2*dot(lVec.xyz, bumpNormal.xyz)*bumpNormal.xyz)-lVec.xyz; 
          half  NdotR = saturate(dot(reflVec.xyz, viewVec.xyz)); 
          half  fPow = SpecularExp.x; 
 
          half  specVal = pow(NdotR, fPow); 
          half3  spec = specVal * ( half3  )Specular[i].xyz; 
 
          spec = spec * atten * filterColor.xyz; 
 
          spec.xyz *= fAlpha; 
 
          spec.xyz = HDREncode(spec.xyz);    
          vFinalSpec += spec.xyz; 
 
        } 
    #endif 
 
        half3  amb = decalColor.xyz * ( half3  )Ambient.xyz; 
        half3  difLM = float3(0,0,0); 
        half3  env = float3(0,0,0); 
 
        amb = HDREncodeAmb(amb); 
        env = HDREncodeAmb(env); 
 
        half3  finalColor = amb + env + difLM + vFinalDif + vFinalSpec; 
        half  _fFogFrac = IN.viewVec.w; 
#ifdef _HDR 
        _finColor.xyz = lerp(HDREncodeAmb(( half3  )GlobalFogColor.xyz), finalColor.xyz, _fFogFrac); 
#else 
        _finColor.xyz = lerp(( half3  )GlobalFogColor.xyz, saturate(finalColor.xyz), saturate(_fFogFrac)); 
#endif          
 
        _finColor.w = Ambient.w; 
 
      return _finColor; 
}

and here's the resulting hardware "assembly":

Code:
 Shader stats:
     RS Instructions:         8
     TEX Instructions:        7
     ALU Instructions:       52
     ALU Instruction slots:  52
     CF Instructions:         0
     Pix Size:               12
     Highest Const:          31
     Start Addr:              0
     End Addr:               58
 
 RS Instructions:
 
   rs 00:                            r00.rg-- = txc00
   rs 01:                            r01.rg-- = txc01
   rs 02:                            r02.rgba = txc02 adjusted
   rs 03:                            r03.rgb- = txc03
   rs 04:                            r04.rgb- = txc04
   rs 05:                            r05.rgb- = txc05
   rs 06:                            r06.rgb- = txc06
   rs 07:                            r07.rgb- = txc07 adjusted
 
 US Program:
 
  0 tex 00    :  r01.rgb_ = lookup(r01.rgrr, tex01) ign_unc
  1 tex 01    :  r00.rgba = lookup(r00.rgrr, tex00) ign_unc
  2 tex 02    :  r07.rgb_ = lookup(r07.rgbr, tex03) sem_wait sem_grab ign_unc
  3 alu 00 rgb:             r08.--b = dp3(r04.rgb, r04.rgb)
         alpha:             r01.a   = clamped mad(r02.a, 1.0, 0.0)  
  4 alu 01 rgb:             r08.--b = dp3(r03.rgb, r03.rgb)
         alpha:             r02.a   = rsq(abs(r08.b))  
  5 alu 02 rgb:             r08.r-- = dp3(r05.rgb, r05.rgb)
         alpha:             r03.a   = rsq(abs(r08.b))  
  6 alu 03 rgb:             r08.-g- = dp3(r06.rgb, r06.rgb)
         alpha:             r04.a   = rsq(abs(r08.r))  
  7 alu 04 rgb:             r08.rgb = mad(r01.rgb, 1.0, c11.rrr)*2 sem_wait
         alpha:             r05.a   = rsq(abs(r08.g))  
  8 alu 05 pre:  srcp.rgb = 1.0-2.0*r01.rgb
  8 alu 05 rgb:             r01.rgb = cmp(neg(srcp.rgb), r08.rgb, nab(c10.rrr))
         alpha:             r06.a   = rcp(r02.a)  
  9 alu 06 rgb:             r08.rg- = mad(r06.a0.0r, c01.g0.00.0, (+2.5000000E-01).0.0ar)
         alpha:             r06.a   = rcp(r03.a)  
  10 alu 07 rgb:             r09.rg- = mad(r06.a0.0r, c01.r0.00.0, (+2.5000000E-01).0.0ar)
         alpha:             r06.a   = rcp(r04.a)  
  11 alu 08 rgb:             r10.rg- = mad(r06.a0.0r, c01.b0.00.0, (+2.5000000E-01).0.0ar)
         alpha:             r06.a   = rcp(r05.a)  
  12 alu 09 rgb:             r11.rg- = mad(r06.a0.0r, c01.a0.0r, (+2.5000000E-01).0.0ar)
         alpha:             r06.a   = mad(r01.b, r01.b, 0.0)  
  13 tex 03    :  r08.r___ = lookup(r08.rgrr, tex02) alu_wait ign_unc
  14 tex 04    :  r09.r___ = lookup(r09.rgrr, tex02) ign_unc
  15 tex 05    :  r10.r___ = lookup(r10.rgrr, tex02) ign_unc
  16 tex 06    :  r11.rrr_ = lookup(r11.rgrr, tex02) sem_wait sem_grab ign_unc
  17 alu 10 rgb:             r12.r-- = d2a(r01.rg0.0, r01.rg0.0, r06.rra)
         alpha:             r06.a   = mad(r02.b, r02.b, 0.0)  
  18 alu 11 rgb:             r04.rgb = mad(r04.rgb, r02.aaa, 0.0)
         alpha:             r02.a   = rsq(r12.r)  
  19 alu 12 rgb:             r01.rgb = mad(r01.rgb, r02.aaa, 0.0)
         alpha:             r02.a   = mad(r02.g, r02.g, r06.a)  
  20 alu 13 rgb:             r08.-g- = dp3(r04.rgb, r01.rgb)*2
         alpha:             r02.a   = mad(r02.r, r02.r, r02.a)  
  21 alu 14 rgb:             r03.rgb = mad(r03.rgb, r03.aaa, 0.0)
         alpha:             r02.a   = rsq(r02.a)  
  22 alu 15 rgb:             r04.rgb = mad(r01.rgb, r08.ggg, neg(r04.rgb))
         alpha:             r03.a   = clamped mad(r08.g, 1.0, 0.0)/2 
  23 alu 16 rgb:             r02.rgb = mad(r02.rgb, r02.aaa, 0.0) sem_wait
         alpha:             r02.a   = mad(r08.r, r03.a, 0.0)*2 
  24 alu 17 rgb:             r08.-g- = dp3(r01.rgb, r03.rgb)*2
         alpha:                       mad(0.0, 0.0, 0.0)  
  25 alu 18 rgb:             r05.rgb = mad(r05.rgb, r04.aaa, 0.0)
         alpha:             r03.a   = clamped mad(r08.g, 1.0, 0.0)/2 
  26 alu 19 rgb:             r04.r-- = clamped dp3(r04.rgb, r02.rgb)
         alpha:                       mad(0.0, 0.0, 0.0)  
  27 alu 20 rgb:             r03.rgb = mad(r01.rgb, r08.ggg, neg(r03.rgb))
         alpha:                       mad(0.0, 0.0, 0.0)  
  28 alu 21 rgb:             r08.--b = dp3(r01.rgb, r05.rgb)*2
         alpha:             r06.a   = ln2(r04.r)  
  29 alu 22 rgb:             r04.rgb = mad(r06.rgb, r05.aaa, 0.0)
         alpha:             r04.a   = mad(r06.a, c14.r, 0.0)  
  30 alu 23 rgb:             r03.--b = clamped dp3(r02.rgb, r03.rgb)
         alpha:             r04.a   = ex2(r04.a)*2 
  31 alu 24 rgb:             r05.rgb = mad(r01.rgb, r08.bbb, neg(r05.rgb))
         alpha:             r04.a   = mad(r08.r, r04.a, 0.0)  
  32 alu 25 rgb:             r03.r-- = dp3(r01.rgb, r04.rgb)*2
         alpha:             r05.a   = ln2(r03.b)  
  33 alu 26 rgb:             r03.--b = clamped dp3(r02.rgb, r05.rgb)
         alpha:             r07.a   = mad(r05.a, c14.r, 0.0)  
  34 alu 27 rgb:             r01.rgb = mad(r01.rgb, r03.rrr, neg(r04.rgb))
         alpha:             r11.a   = ln2(r03.b)  
  35 alu 28 rgb:             r03.--b = clamped dp3(r02.rgb, r01.rgb)
         alpha:             r07.a   = ex2(r07.a)  
  36 alu 29 rgb:             r01.rgb = mad(r00.rgb, c03.rgb, 0.0)
         alpha:             r11.a   = mad(r11.a, c14.r, 0.0)  
  37 alu 30 rgb:             r02.rgb = mad(r00.rgb, r03.aaa, 0.0)
         alpha:             r03.a   = clamped mad(r08.b, 1.0, 0.0)/2 
  38 alu 31 rgb:             r01.rgb = mad(r02.aaa, r01.rgb, 0.0)
         alpha:             r02.a   = ln2(r03.b)  
  39 alu 32 rgb:             r02.rgb = mad(r02.rgb, c02.rgb, 0.0)*2
         alpha:             r09.a   = mad(r04.a, r00.a, 0.0)  
  40 alu 33 rgb:             r04.rgb = mad(r00.rgb, r03.aaa, 0.0)
         alpha:             r04.a   = mad(r09.r, r07.a, 0.0)  
  41 alu 34 rgb:             r05.rgb = mad(c07.rgb, r09.aaa, 0.0)
         alpha:             r05.a   = ex2(r11.a)  
  42 alu 35 rgb:             r06.rgb = mad(c06.rgb, r04.aaa, 0.0)*2
         alpha:             r02.a   = mad(r02.a, c14.r, 0.0)  
  43 alu 36 rgb:             r01.rgb = mad(r09.rrr, r02.rgb, r01.rgb)
         alpha:             r04.a   = ex2(r02.a)  
  44 alu 37 rgb:             r04.rgb = mad(r04.rgb, c04.rgb, 0.0)*2
         alpha:             r06.a   = clamped mad(r03.r, 1.0, 0.0)/2 
  45 alu 38 rgb:             r02.rgb = mad(r00.rgb, r06.aaa, 0.0)
         alpha:             r02.a   = mad(r10.r, r05.a, 0.0)  
  46 alu 39 rgb:             r03.rgb = mad(r00.aaa, r06.rgb, r05.rgb)
         alpha:                       mad(0.0, 0.0, 0.0)  
  47 alu 40 rgb:             r05.rgb = mad(c08.rgb, r02.aaa, 0.0)*2
         alpha:                       mad(0.0, 0.0, 0.0)  
  48 alu 41 rgb:             r06.rgb = mad(r04.aaa, c09.rgb, 0.0)
         alpha:                       mad(0.0, 0.0, 0.0)  
  49 alu 42 rgb:             r07.rgb = mad(r11.rgb, r07.rgb, 0.0)
         alpha:                       mad(0.0, 0.0, 0.0)  
  50 alu 43 rgb:             r02.rgb = mad(r02.rgb, c05.rgb, 0.0)*2
         alpha:                       mad(0.0, 0.0, 0.0)  
  51 alu 44 rgb:             r01.rgb = mad(r10.rrr, r04.rgb, r01.rgb)
         alpha:                       mad(0.0, 0.0, 0.0)  
  52 alu 45 rgb:             r03.rgb = mad(r00.aaa, r05.rgb, r03.rgb)
         alpha:                       mad(0.0, 0.0, 0.0)  
  53 alu 46 rgb:             r04.rgb = mad(r07.rgb, r06.rgb, 0.0)*2
         alpha:                       mad(0.0, 0.0, 0.0)  
  54 alu 47 rgb:             r01.rgb = mad(r02.rgb, r07.rgb, r01.rgb)
         alpha:                       mad(0.0, 0.0, 0.0)  
  55 alu 48 rgb:             r02.rgb = mad(r00.aaa, r04.rgb, r03.rgb)
         alpha:                       mad(0.0, 0.0, 0.0)  
  56 alu 49 rgb:             r01.rgb = mad(r00.rgb, c00.rgb, r01.rgb)
         alpha:                       mad(0.0, 0.0, 0.0)  
  57 alu 50 rgb:             r01.rgb = clamped mad(r02.rgb, 1.0, r01.rgb)
         alpha:                       mad(0.0, 0.0, 0.0)  
   alu 50 post-NOP
  58 alu 51 pre:  srcp.rgb = r01.rgb-c31.rgb
  58 alu 51 rgb:  out0.rgb =           mad(srcp.rgb, r01.aaa, c31.rgb)
         alpha:  out0.a   =           mad(c00.a, 1.0, 0.0)  last

The first 3 TEX instructions are issued as a group, under a common semaphore. This semaphore synchronises at instruction slot 7. The key thing is that all three TEX instructions need to finish in this time. Despite the fact that TEX 01 is required for instruction 36 and TEX 02 is required for instruction 49.

That means there are 4 ALUs that can run in parallel with the 3 TEXs. That's a 1.33:1 ALU:TEX ratio. But R580 can run 3 ALUs per clock, so the effective ratio is actually 0.44:1.

R580 can seemingly allocate 21 batches per cluster for this shader (assuming 128 batches * 48 fragments * 2 nominal registers = 12288 = 21.33 * 48 * 12 actual registers), that's 84 ALU clocks per instruction slot of latency hiding (21 batches * 4 clocks per slot), somewhat short of ~200 cycles that's taken as the "norm" for bilinear latency. This is counteracted by the shader's overall 2.71:1 ratio, implying ~227 ALU cycles per TEX fetch of latency hiding.

But at the start of this shader, those 4 ALU instructions hide a total of 84 * 0.44 latency-hiding ~37 total clocks of texturing latency per texture fetch.

That's quite a bottleneck, and I think it might lie at the heart of R5xx's "large" register file. I suspect G8x's ability to have multiple concurrent TEX semaphores allows it to get away with much less register file.

Jawed
 
So, does that mean G80 isn't that wasteful, as early assumptions, for texturing latency counteracting? :???:
Any similar data for R600?
 
I've tended to view G80 as "marginal" in terms of texturing latency-hiding, not wasteful. If I'm right about multiple, overlapping semaphores in G8x, then it means you can put it into tighter corners than R5xx and it'll still come out on top.

AMD is keeping R600 analysis tools close to its chest, so no clues there. I dare say I doubt R600 will be different in this respect. This single-semaphore model appears to be in Xenos, too, so I'm kinda sceptical that R600 is "better". I think it's fundamental to ATI's threading model.

I think G80 may not actually use explicit semaphores - this function may be nothing more than "register rrr now contains the result of TEX operation nnn", as I think the architecture uses per-register interlocking to prevent a batch being issued to the ALU pipe until all registers are "ready". Should be easy to implement as bit-fields in a scoreboard, I guess. This interlocking also deals with the read after write hazard, I think.

Jawed
 
I was actually kind of surprised by the texture semaphore stuff when CTM came out. Until then I'd just assumed the decoupled texture lookups used some kiind of register scoreboarding. Simple, well-understood, widely-used, and not very expensive. So, even though that's not what ATI does after all, in the absence of information to the contrary I'm going to assume it's what NV does ;).

The R600 diagrams cleared this up a little though. Apparently ATI considers switching between ALU segments and TEX segments a fairly heavy-weight operation. Every texture request/result has to cross the ring-bus... I'd be surprised if every switch didn't cost 10s of cycles (hidden by other work normally, of course). This kind of explains ATI's traditional limits and perf problems with many levels of dependent texturing.

On the other hand, G80's ALU and texture units are much more tightly coupled, so it's probably cheaper for them.

It would be interesting to try to measure how expensive toggling between texture lookup sections and math sections is on the two architectures. Would be kind of hard to isolate though. Hmmm...
 
I was actually kind of surprised by the texture semaphore stuff when CTM came out. Until then I'd just assumed the decoupled texture lookups used some kiind of register scoreboarding. Simple, well-understood, widely-used, and not very expensive. So, even though that's not what ATI does after all, in the absence of information to the contrary I'm going to assume it's what NV does ;).
I dare say that's in the region of 99%+ certainty based on my vague memories of patents, forum posts and what Shebanow said - plus time to reflect on it all.

The R600 diagrams cleared this up a little though. Apparently ATI considers switching between ALU segments and TEX segments a fairly heavy-weight operation.
I suspect it also relates to the jump required when dynamic branching doesn't follow the default path. Or, perhaps, the compiler produces separate clauses for each branch/iteration and then the on-chip scheduling issues clauses depending on the results of a BR/CMP - the SEQ has to decide whether to issue the clause with predication or whether the clause can be junked (exited/skipped). I haven't really got a good understanding of this...

I also sorta wonder whether each clause (bounded by tex wait or branch/compare) is in some way considered as a "separate program". I have a feeling from patents that the thread control processors in R600 issue clauses rather than single instructions at a time, letting them run "to completion", without any need to evaluate register state or program state on each clock. The clause boundaries are the only places where evalution is required.

Every texture request/result has to cross the ring-bus... I'd be surprised if every switch didn't cost 10s of cycles (hidden by other work normally, of course). This kind of explains ATI's traditional limits and perf problems with many levels of dependent texturing.
Before R5xx there wasn't a ring bus - the coupling was much tighter - so I'm not sure what you mean by "traditional limits". I don't know much about dependent texturing performance problems in ATI hardware - though I've assumed for a while now that R3xx...R4xx have a relatively small batch size (nominally 256 fragments, 64 clocks) - but I don't know of any good evidence (apart from screenshots that show 16x16 black squares when a hardware failure occurs). ATI has been recommending coding with 3:1 ALU:TEX since R420 or before, as far as I can tell. All of that would tend to indicate "less" dependent texturing performance, but I'm not familiar with direct evidence.

On the other hand, G80's ALU and texture units are much more tightly coupled, so it's probably cheaper for them.

It would be interesting to try to measure how expensive toggling between texture lookup sections and math sections is on the two architectures. Would be kind of hard to isolate though. Hmmm...
CTM/CUDA should make this easier I imagine.

Jawed
 
Last edited by a moderator:
Back
Top