GPCBenchmark - A OpenCL General Purpose Computing benchmark

Discussion in 'GPGPU Technology & Programming' started by Arnold Beckenbauer, Apr 30, 2010.

  1. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yes certainly; my point was that these benchmarks when applied to compare the "superiority" of various architectures are not particularly useful... you write the algorithms to the architectural strengths nowadays and while comparing the same code running on different architectures tells you some basic things, it isn't that practically useful.

    Yeah but it's the perfect problem size if that's what your application needs to solve :) Certainly it doesn't max out the GPU architectures in terms of throughput, but that may be a problem in the future. I've already got dozens of little kernels that don't even get close to filling up a modern GPU that need to be run as part of various algorithms and I foresee this problem worsening.

    You always want to extract as much parallelism as possible, but there's a point at which the long instruction and memory latencies of GPUs start to become a bit excessive. Any real algorithm is a mix of "very wide parallel" (millions), moderate (thousands) and practically scalar bits of code. For instance, modern GPUs aren't a hell of a lot of faster at practical scans and reductions than older ones since they all just get bottlenecked on the lower levels of the tree (where there is less parallelism). Throughput-oriented parts are definitely great, but I'm starting to see the end of the tunnel where I just can't find enough parallel work to saturate the massive requirements of these pipelines :)

    It will definitely be interesting to see how things continue to develop!
     
  2. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    If you are asking me, I don't have enough data to consider it a general or specific disadvantage. Though there is certainly some evidence pointing towards ATI hardware as having some issues, but we don't know if this is because of the effective first mover advantage that Nvidia has or real specific issues. We'll learn more over time as both sides have to start dealing with legacy code and legacy hardware issues.
     
  3. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    Yep, one could certainly say that maybe one of the draw backs of GPUs is that you need fairly large dataset size kernels in order to both amortize the data transfer costs and spool up/down the parallel hardware. There are certainly going to be workloads that could use a lot of flops but aren't going to get it out of GPGPU because of these issues.


    That Amdahl was smart wasn't he ;)

    The future is going to have to be some form of heterogeneous systems. I honestly don't ever foresee us getting away from x86 at this point (so much legacy and it keeps building and there are no viable challengers on the horizon), so that means either some graphics core(s) like r870 and its future or some LRB cores (one would assume) as the future target point. With coherent, shared memory between them and hopefully things like dynamic power sharing/turbo, etc. Then again, apparently Nvidia thinks things need to be even more parallel.
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The annoying thing is that compute kernels don't need rasterising, but it seems that rasterisation is getting in the way of launching threads :sad: That is, each work item in a domain needs to know its "coordinates" (work group and intra work group), but it seems this stuff is bottlenecked by "rasterisation".

    I'm sure Larrabee doesn't have this problem.

    Obviously caches still take a bit of time to warm up, but as the SIMD count keeps increasing it's a nonsense to make them wait in a queue for work, when a data-parallel execution domain is by definition not a queue.

    Jawed
     
  5. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    Not quite sure I understand your comment. What part of rasterization is getting in the way?

    The main issue for a lot of the spool up/down is the time it takes to just get the data from the main memory to the gpu memory and the results back. And for small problem sizes you have to add in the issue of large latencies and a lack of robust caches in GPUs.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    It takes multiple cycles to create each hardware thread. A core can't hide the latencies inherent in a kernel until it has enough hardware threads up and running to hide those latencies.

    Absolutely, taken as read in any such discussion.

    Data can be kept on the graphics card for the duration of a sequence of different kernels - the programmer isn't obliged to shuffle data to/from the GPU for each kernel invocation.

    Jawed
     
  7. Florin

    Florin Merrily dodgy
    Veteran Subscriber

    Joined:
    Aug 27, 2003
    Messages:
    1,650
    Likes Received:
    220
    Location:
    The colonies
    New round of results for Stream 2.1 (stock 5850 + stock i7-860).

    The benchmark no longer crashes when performing the 'Image-Access' tests on Cypress but quits with a message box saying 'Error #-11 when running OpenCL'.

    The 'Special Funtions' test still says 'Warning: Current platform/driver doesn't support double-precision built-in special functions.'

    Driver Catalyst 10.4 on Windows 7 64-bit.

    Device Info page

    Global Memory
    Local Memory
    Int32 Ops
    Float Ops
    Double Ops
    Common Math
    Image Processing
    Cryption

    Some random observations:
    Generally scores haven't changed much but seems to be trending towards a bit slower (may be circumstancial).
    CPU driver reports a new cl_amd_fp64 extension.
    12.5% improvement on Int32 Ops Add score on Cypress.
    'Native Special Funtions' on Float Ops improved (~8-10%) on both CPU and GPU OpenCL.
    Noticeable drop in Cypress Double-Precision performance during the 'Parallel Reduction' test on Common Math.
     
  8. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    I found the code:

    Code:
    // NLM shader
    // Copyright(c) 2009 Ping-Che Chen 
    
    #define num_threads_x 16
    #define num_threads_y 16
    #define kernel_half 3
    
    
    Texture2D Input : register(t0);
    RWByteAddressBuffer Result : register(u0);
    
    
    cbuffer cb0
    {
    	float g_sigma;
    	int2 g_imagesize;
    }
    
    
    static const float gaussian[7] = {
    	0.1062888f, 0.1403214f, 0.1657702f, 0.1752401f, 0.1657702f, 0.1403214f, 0.1062888f
    };
    
    
    groupshared float shared_r[(num_threads_y + kernel_half * 4) * 48];
    groupshared float shared_g[(num_threads_y + kernel_half * 4) * 48];
    groupshared float shared_b[(num_threads_y + kernel_half * 4) * 48];
    groupshared float dots[(num_threads_y + kernel_half * 2) * 48];
    groupshared float dots2[(num_threads_y + kernel_half * 2) * 16];
    
    
    [numthreads(num_threads_x, num_threads_y, 1)]
    void CSMain(uint3 Gid : SV_DispatchThreadID, uint3 Tid : SV_GroupThreadID)
    {
    	int i, j, l;
    
    	float3 total_color = float3(0.0f, 0.0f, 0.0f);
    	float total_weight = 0.0f;
    	float3 c;
    
    	c = Input.Load(clamp(int3(Gid.x - kernel_half * 2, Gid.y - kernel_half * 2, 0), int3(0, 0, 0), int3(Gid.x, Gid.y, 0)));
    	shared_r[Tid.y * 48 + Tid.x] = c.r;
    	shared_g[Tid.y * 48 + Tid.x] = c.g;
    	shared_b[Tid.y * 48 + Tid.x] = c.b;
    	
    	if(Tid.x < kernel_half * 4) {
    		c = Input.Load(clamp(int3(Gid.x + num_threads_x - kernel_half * 2, Gid.y - kernel_half * 2, 0), int3(0, 0, 0), int3(Gid.x, Gid.y, 0)));
    		shared_r[Tid.y * 48 + Tid.x + num_threads_x] = c.r;
    		shared_g[Tid.y * 48 + Tid.x + num_threads_x] = c.g;
    		shared_b[Tid.y * 48 + Tid.x + num_threads_x] = c.b;
    	}
    	
    	if(Tid.y < kernel_half * 4) {
    		c = Input.Load(clamp(int3(Gid.x - kernel_half * 2, Gid.y + num_threads_y - kernel_half * 2, 0), int3(0, 0, 0), int3(Gid.x, Gid.y, 0)));
    		shared_r[(Tid.y + num_threads_y) * 48 + Tid.x] = c.r;
    		shared_g[(Tid.y + num_threads_y) * 48 + Tid.x] = c.g;
    		shared_b[(Tid.y + num_threads_y) * 48 + Tid.x] = c.b;
    		
    		if(Tid.x < kernel_half * 4) {
    			c = Input.Load(clamp(int3(Gid.x + num_threads_x - kernel_half * 2, Gid.y + num_threads_y - kernel_half * 2, 0), int3(0, 0, 0), int3(Gid.x, Gid.y, 0)));
    			shared_r[(Tid.y + num_threads_y) * 48 + Tid.x + num_threads_x] = c.r;
    			shared_g[(Tid.y + num_threads_y) * 48 + Tid.x + num_threads_x] = c.g;
    			shared_b[(Tid.y + num_threads_y) * 48 + Tid.x + num_threads_x] = c.b;
    		}
    	}
    	
    	GroupMemoryBarrierWithGroupSync();
    
    	for(i = -kernel_half; i <= kernel_half; i++) {
    		for(j = -kernel_half; j <= kernel_half; j++) {
    			float total;
    			float3 c1, c2, cd, cp;
    
    			int y1 = (i + Tid.y + kernel_half) * 48 + j + Tid.x + kernel_half;
    			int y2 = (Tid.y + kernel_half) * 48 + Tid.x + kernel_half;			
    			
    			c1 = float3(shared_r[y1], shared_g[y1], shared_b[y1]);
    			c2 = float3(shared_r[y2], shared_g[y2], shared_b[y2]);
    			cd = (c2 - c1);
    			dots[Tid.y * 48 + Tid.x] = dot(cd, cd);
    			
    			if(Tid.x < kernel_half * 2) {
    				c1 = float3(shared_r[y1 + num_threads_x], shared_g[y1 + num_threads_x], shared_b[y1 + num_threads_x]);
    				c2 = float3(shared_r[y2 + num_threads_x], shared_g[y2 + num_threads_x], shared_b[y2 + num_threads_x]);
    				cd = (c2 - c1);
    				dots[Tid.y * 48 + Tid.x + num_threads_x] = dot(cd, cd);
    			}
    			
    			if(Tid.y < kernel_half * 2) {
    				c1 = float3(shared_r[y1 + num_threads_y * 48], shared_g[y1 + num_threads_y * 48], shared_b[y1 + num_threads_y * 48]);
    				c2 = float3(shared_r[y2 + num_threads_y * 48], shared_g[y2 + num_threads_y * 48], shared_b[y2 + num_threads_y * 48]);
    				cd = (c2 - c1);
    				dots[(Tid.y + num_threads_y) * 48 + Tid.x] = dot(cd, cd);
    				
    				if(Tid.x < kernel_half * 2) {
    					c1 = float3(shared_r[y1 + num_threads_y * 48 + num_threads_x], shared_g[y1 + num_threads_y * 48 + num_threads_x], shared_b[y1 + num_threads_y * 48 + num_threads_x]);
    					c2 = float3(shared_r[y2 + num_threads_y * 48 + num_threads_x], shared_g[y2 + num_threads_y * 48 + num_threads_x], shared_b[y2 + num_threads_y * 48 + num_threads_x]);
    					cd = (c2 - c1);
    					dots[(Tid.y + num_threads_y) * 48 + Tid.x + num_threads_x] = dot(cd, cd);
    				}
    			}
    			
    			GroupMemoryBarrierWithGroupSync();
    					
    			float d = 0.0f;		
    			[unroll]
    			for(l = 0; l <= kernel_half * 2; l++) {
    				d += dots[Tid.y * 48 + Tid.x + l] * gaussian[l];
    			}
    			
    			dots2[Tid.y * num_threads_x + Tid.x] = d;
    			
    			if(Tid.y < kernel_half * 2) {
    				d = 0.0f;
    				[unroll]
    				for(l = 0; l <= kernel_half * 2; l++) {
    					d += dots[(Tid.y + num_threads_y) * 48 + Tid.x + l] * gaussian[l];
    				}
    				
    				dots2[(Tid.y + num_threads_y) * num_threads_x + Tid.x] = d;
    			}
    			
    			GroupMemoryBarrierWithGroupSync();
    
    			d = 0.0f;
    			
    			[unroll]
    			for(l = 0; l <= kernel_half * 2; l++) {
    				d += dots2[(Tid.y + l) * num_threads_x + Tid.x] * gaussian[l];
    			}
    
    			total = exp(-d * g_sigma);
    			
    			cp = float3(shared_r[(i + Tid.y + kernel_half * 2) * 48 + j + Tid.x + kernel_half * 2],
    				shared_g[(i + Tid.y + kernel_half * 2) * 48 + j + Tid.x + kernel_half * 2],
    				shared_b[(i + Tid.y + kernel_half * 2) * 48 + j + Tid.x + kernel_half * 2]);
    				
    			total_weight += total;
    			total_color += total * cp;
    		}
    	}
    
    	float3 vColor = total_color / total_weight;
    
    	if(Gid.x < g_imagesize.x && Gid.y < g_imagesize.y) {
    		float3 vc = vColor * 255 + 0.5;
    		vc = clamp(vc, 0, 255);
    		uint x = vc.x;
    		uint y = vc.y;
    		uint z = vc.z;
    		uint c = z + (y << 8) + (x << 16);
    		Result.Store((Gid.y * g_imagesize.x + Gid.x) * 4, c);
    	}
    }
    It takes about 3 minutes to compile on my computer with a Radeon 5850. If I added [loop] to the two main loops, it fails to compile, with error message like:

    "synchronization operations cannot be used in varying flow control"
    and
    "Can't unroll loops marked with loop attribute"
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I just put that code into GPUSA version 1.53.2258. It took about 30 seconds to compile. I made sure to have just 1 GPU selected for compilation, HD5870.

    It produces one hell of a tortuous ISA! 60 GPRs and 130 scratch registers :shock:

    EDIT: whoops, not paying attention. With the first loop [loop] works, producing a compilation even more quickly, about 20 seconds. With both loops marked, I get the failure you are reporting after less than 5 seconds. If solely the second loop is marked, then it takes about 30 seconds to fail with the same messages you're getting.

    This is the D3D assembly produced with the first loop marked by [loop]:
    Code:
    //
    // Generated by Microsoft (R) HLSL Shader Compiler 9.27.952.3022
    //
    //
    // Buffer Definitions: 
    //
    // cbuffer cb0
    // {
    //
    //   float g_sigma;                     // Offset:    0 Size:     4
    //   int2 g_imagesize;                  // Offset:    4 Size:     8
    //
    // }
    //
    //
    // Resource Bindings:
    //
    // Name                                 Type  Format         Dim Slot Elements
    // ------------------------------ ---------- ------- ----------- ---- --------
    // Input                             texture  float4          2d    0        1
    // Result                                UAV    byte         r/w    0        1
    // cb0                               cbuffer      NA          NA    0        1
    //
    //
    //
    // Input signature:
    //
    // Name                 Index   Mask Register SysValue Format   Used
    // -------------------- ----- ------ -------- -------- ------ ------
    // no Input
    //
    // Output signature:
    //
    // Name                 Index   Mask Register SysValue Format   Used
    // -------------------- ----- ------ -------- -------- ------ ------
    // no Output
    cs_5_0
    dcl_globalFlags refactoringAllowed
    dcl_constantbuffer cb0[1], immediateIndexed
    dcl_resource_texture2d (float,float,float,float) t0
    dcl_uav_raw u0
    dcl_input vThreadIDInGroup.xy
    dcl_input vThreadID.xy
    dcl_temps 15
    dcl_tgsm_structured g0, 4, 1344
    dcl_tgsm_structured g1, 4, 1344
    dcl_tgsm_structured g2, 4, 1344
    dcl_tgsm_structured g3, 4, 1056
    dcl_tgsm_structured g4, 4, 352
    dcl_thread_group 16, 16, 1
    iadd r0.zw, vThreadID.xxxy, l(0, 0, -6, -6)
    imax r1.xy, r0.zwzz, l(0, 0, 0, 0)
    imin r1.xy, r1.xyxx, vThreadID.xyxx
    mov r1.zw, l(0,0,0,0)
    ld_indexable(texture2d)(float,float,float,float) r1.xyz, r1.xyzw, t0.xyzw
    imad r1.w, vThreadIDInGroup.y, l(48), vThreadIDInGroup.x
    store_structured g0.x, r1.w, l(0), r1.x
    store_structured g1.x, r1.w, l(0), r1.y
    store_structured g2.x, r1.w, l(0), r1.z
    ult r2.xyzw, vThreadIDInGroup.xyxy, l(12, 12, 6, 6)
    if_nz r2.x
      iadd r0.x, vThreadID.x, l(10)
      imax r1.xy, r0.xwxx, l(0, 0, 0, 0)
      imin r3.xy, r1.xyxx, vThreadID.xyxx
      mov r3.zw, l(0,0,0,0)
      ld_indexable(texture2d)(float,float,float,float) r1.xyz, r3.xyzw, t0.xyzw
      iadd r0.w, r1.w, l(16)
      store_structured g0.x, r0.w, l(0), r1.x
      store_structured g1.x, r0.w, l(0), r1.y
      store_structured g2.x, r0.w, l(0), r1.z
    endif 
    if_nz r2.y
      iadd r0.y, vThreadID.y, l(10)
      imax r0.zw, r0.zzzy, l(0, 0, 0, 0)
      imin r3.xy, r0.zwzz, vThreadID.xyxx
      mov r3.zw, l(0,0,0,0)
      ld_indexable(texture2d)(float,float,float,float) r1.xyz, r3.xyzw, t0.xyzw
      iadd r0.z, vThreadIDInGroup.y, l(16)
      imad r0.z, r0.z, l(48), vThreadIDInGroup.x
      store_structured g0.x, r0.z, l(0), r1.x
      store_structured g1.x, r0.z, l(0), r1.y
      store_structured g2.x, r0.z, l(0), r1.z
      if_nz r2.x
        iadd r0.x, vThreadID.x, l(10)
        imax r0.xy, r0.xyxx, l(0, 0, 0, 0)
        imin r3.xy, r0.xyxx, vThreadID.xyxx
        mov r3.zw, l(0,0,0,0)
        ld_indexable(texture2d)(float,float,float,float) r1.xyz, r3.xyzw, t0.xyzw
        iadd r0.x, r0.z, l(16)
        store_structured g0.x, r0.x, l(0), r1.x
        store_structured g1.x, r0.x, l(0), r1.y
        store_structured g2.x, r0.x, l(0), r1.z
      endif 
    endif 
    sync_g_t
    iadd r0.xy, vThreadIDInGroup.yyyy, l(16, 3, 0, 0)
    imad r0.zw, r0.yyyx, l(0, 0, 48, 48), vThreadIDInGroup.xxxx
    iadd r3.xyzw, r0.zzzz, l(3, 19, 771, 787)
    iadd r4.xyzw, r1.wwww, l(16, 1, 2, 3)
    iadd r1.xyz, r1.wwww, l(4, 5, 6, 0)
    ishl r0.z, vThreadIDInGroup.y, l(4)
    iadd r5.xyzw, r0.wwww, l(16, 1, 2, 3)
    iadd r6.xyz, r0.wwww, l(4, 5, 6, 0)
    ishl r0.xy, r0.xyxx, l(4, 4, 0, 0)
    iadd r0.xyz, r0.xyzx, vThreadIDInGroup.xxxx
    iadd r7.xyzw, r0.zzzz, l(16, 32, 64, 80)
    iadd r2.x, r0.z, l(96)
    mov r8.xyz, l(0,0,0,0)
    mov r2.y, l(-3)
    mov r6.w, l(0)
    loop 
      ilt r8.w, l(3), r2.y
      breakc_nz r8.w
      iadd r8.w, r2.y, vThreadIDInGroup.y
      iadd r9.xy, r8.wwww, l(3, 6, 0, 0)
      imul null, r9.zw, r9.xxxy, l(0, 0, 48, 48)
      imad r10.xyz, r9.yxyy, l(48, 48, 48, 0), vThreadIDInGroup.xxxx
      ld_structured r11.x, r10.y, l(0), g0.xxxx
      ld_structured r11.y, r10.y, l(0), g1.xxxx
      ld_structured r11.z, r10.y, l(0), g2.xxxx
      ld_structured r12.x, r3.x, l(0), g0.xxxx
      ld_structured r12.y, r3.x, l(0), g1.xxxx
      ld_structured r12.z, r3.x, l(0), g2.xxxx
      add r11.xyz, -r11.xyzx, r12.xyzx
      dp3 r8.w, r11.xyzx, r11.xyzx
      store_structured g3.x, r1.w, l(0), r8.w
      if_nz r2.z
        iadd r8.w, r10.y, l(16)
        ld_structured r11.x, r8.w, l(0), g0.xxxx
        ld_structured r11.y, r8.w, l(0), g1.xxxx
        ld_structured r11.z, r8.w, l(0), g2.xxxx
        ld_structured r12.x, r3.y, l(0), g0.xxxx
        ld_structured r12.y, r3.y, l(0), g1.xxxx
        ld_structured r12.z, r3.y, l(0), g2.xxxx
        add r11.xyz, -r11.xyzx, r12.xyzx
        dp3 r8.w, r11.xyzx, r11.xyzx
        store_structured g3.x, r4.x, l(0), r8.w
      endif 
      if_nz r2.w
        iadd r8.w, r10.y, l(768)
        ld_structured r11.x, r8.w, l(0), g0.xxxx
        ld_structured r11.y, r8.w, l(0), g1.xxxx
        ld_structured r11.z, r8.w, l(0), g2.xxxx
        ld_structured r12.x, r3.z, l(0), g0.xxxx
        ld_structured r12.y, r3.z, l(0), g1.xxxx
        ld_structured r12.z, r3.z, l(0), g2.xxxx
        add r11.xyz, -r11.xyzx, r12.xyzx
        dp3 r8.w, r11.xyzx, r11.xyzx
        store_structured g3.x, r0.w, l(0), r8.w
        if_nz r2.z
          iadd r8.w, r10.y, l(784)
          ld_structured r11.x, r8.w, l(0), g0.xxxx
          ld_structured r11.y, r8.w, l(0), g1.xxxx
          ld_structured r11.z, r8.w, l(0), g2.xxxx
          ld_structured r12.x, r3.w, l(0), g0.xxxx
          ld_structured r12.y, r3.w, l(0), g1.xxxx
          ld_structured r12.z, r3.w, l(0), g2.xxxx
          add r11.xyz, -r11.xyzx, r12.xyzx
          dp3 r8.w, r11.xyzx, r11.xyzx
          store_structured g3.x, r5.x, l(0), r8.w
        endif 
      endif 
      sync_g_t
      ld_structured r8.w, r1.w, l(0), g3.xxxx
      ld_structured r9.x, r4.y, l(0), g3.xxxx
      mul r9.x, r9.x, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.x
      ld_structured r9.x, r4.z, l(0), g3.xxxx
      mad r8.w, r9.x, l(0.165770), r8.w
      ld_structured r9.x, r4.w, l(0), g3.xxxx
      mad r8.w, r9.x, l(0.175240), r8.w
      ld_structured r9.x, r1.x, l(0), g3.xxxx
      mad r8.w, r9.x, l(0.165770), r8.w
      ld_structured r9.x, r1.y, l(0), g3.xxxx
      mad r8.w, r9.x, l(0.140321), r8.w
      ld_structured r9.x, r1.z, l(0), g3.xxxx
      mad r8.w, r9.x, l(0.106289), r8.w
      store_structured g4.x, r0.z, l(0), r8.w
      if_nz r2.w
        ld_structured r8.w, r0.w, l(0), g3.xxxx
        ld_structured r9.x, r5.y, l(0), g3.xxxx
        mul r9.x, r9.x, l(0.140321)
        mad r8.w, r8.w, l(0.106289), r9.x
        ld_structured r9.x, r5.z, l(0), g3.xxxx
        mad r8.w, r9.x, l(0.165770), r8.w
        ld_structured r9.x, r5.w, l(0), g3.xxxx
        mad r8.w, r9.x, l(0.175240), r8.w
        ld_structured r9.x, r6.x, l(0), g3.xxxx
        mad r8.w, r9.x, l(0.165770), r8.w
        ld_structured r9.x, r6.y, l(0), g3.xxxx
        mad r8.w, r9.x, l(0.140321), r8.w
        ld_structured r9.x, r6.z, l(0), g3.xxxx
        mad r8.w, r9.x, l(0.106289), r8.w
        store_structured g4.x, r0.x, l(0), r8.w
      endif 
      sync_g_t
      ld_structured r8.w, r0.z, l(0), g4.xxxx
      ld_structured r9.x, r7.x, l(0), g4.xxxx
      mul r9.x, r9.x, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.x
      ld_structured r9.x, r7.y, l(0), g4.xxxx
      mad r8.w, r9.x, l(0.165770), r8.w
      ld_structured r9.x, r0.y, l(0), g4.xxxx
      mad r8.w, r9.x, l(0.175240), r8.w
      ld_structured r9.x, r7.z, l(0), g4.xxxx
      mad r8.w, r9.x, l(0.165770), r8.w
      ld_structured r9.x, r7.w, l(0), g4.xxxx
      mad r8.w, r9.x, l(0.140321), r8.w
      ld_structured r9.x, r2.x, l(0), g4.xxxx
      mad r8.w, r9.x, l(0.106289), r8.w
      mul r8.w, -r8.w, cb0[0].x
      mul r8.w, r8.w, l(1.442695)
      exp r8.w, r8.w
      iadd r11.xyzw, r10.zyzy, l(3, 1, 4, 2)
      ld_structured r12.x, r11.x, l(0), g0.xxxx
      ld_structured r12.y, r11.x, l(0), g1.xxxx
      ld_structured r12.z, r11.x, l(0), g2.xxxx
      add r9.x, r6.w, r8.w
      mad r12.xyz, r8.wwww, r12.xyzx, r8.xyzx
      ld_structured r13.x, r11.y, l(0), g0.xxxx
      ld_structured r13.y, r11.y, l(0), g1.xxxx
      ld_structured r13.z, r11.y, l(0), g2.xxxx
      ld_structured r14.x, r3.x, l(0), g0.xxxx
      ld_structured r14.y, r3.x, l(0), g1.xxxx
      ld_structured r14.z, r3.x, l(0), g2.xxxx
      add r13.xyz, -r13.xyzx, r14.xyzx
      dp3 r8.w, r13.xyzx, r13.xyzx
      store_structured g3.x, r1.w, l(0), r8.w
      if_nz r2.z
        iadd r8.w, r10.y, l(17)
        ld_structured r13.x, r8.w, l(0), g0.xxxx
        ld_structured r13.y, r8.w, l(0), g1.xxxx
        ld_structured r13.z, r8.w, l(0), g2.xxxx
        ld_structured r14.x, r3.y, l(0), g0.xxxx
        ld_structured r14.y, r3.y, l(0), g1.xxxx
        ld_structured r14.z, r3.y, l(0), g2.xxxx
        add r13.xyz, -r13.xyzx, r14.xyzx
        dp3 r8.w, r13.xyzx, r13.xyzx
        store_structured g3.x, r4.x, l(0), r8.w
      endif 
      if_nz r2.w
        iadd r8.w, r10.y, l(769)
        ld_structured r13.x, r8.w, l(0), g0.xxxx
        ld_structured r13.y, r8.w, l(0), g1.xxxx
        ld_structured r13.z, r8.w, l(0), g2.xxxx
        ld_structured r14.x, r3.z, l(0), g0.xxxx
        ld_structured r14.y, r3.z, l(0), g1.xxxx
        ld_structured r14.z, r3.z, l(0), g2.xxxx
        add r13.xyz, -r13.xyzx, r14.xyzx
        dp3 r8.w, r13.xyzx, r13.xyzx
        store_structured g3.x, r0.w, l(0), r8.w
        if_nz r2.z
          iadd r8.w, r10.y, l(785)
          ld_structured r13.x, r8.w, l(0), g0.xxxx
          ld_structured r13.y, r8.w, l(0), g1.xxxx
          ld_structured r13.z, r8.w, l(0), g2.xxxx
          ld_structured r14.x, r3.w, l(0), g0.xxxx
          ld_structured r14.y, r3.w, l(0), g1.xxxx
          ld_structured r14.z, r3.w, l(0), g2.xxxx
          add r13.xyz, -r13.xyzx, r14.xyzx
          dp3 r8.w, r13.xyzx, r13.xyzx
          store_structured g3.x, r5.x, l(0), r8.w
        endif 
      endif 
      sync_g_t
      ld_structured r8.w, r1.w, l(0), g3.xxxx
      ld_structured r9.y, r4.y, l(0), g3.xxxx
      mul r9.y, r9.y, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.y
      ld_structured r9.y, r4.z, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r4.w, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.175240), r8.w
      ld_structured r9.y, r1.x, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r1.y, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.140321), r8.w
      ld_structured r9.y, r1.z, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.106289), r8.w
      store_structured g4.x, r0.z, l(0), r8.w
      if_nz r2.w
        ld_structured r8.w, r0.w, l(0), g3.xxxx
        ld_structured r9.y, r5.y, l(0), g3.xxxx
        mul r9.y, r9.y, l(0.140321)
        mad r8.w, r8.w, l(0.106289), r9.y
        ld_structured r9.y, r5.z, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.165770), r8.w
        ld_structured r9.y, r5.w, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.175240), r8.w
        ld_structured r9.y, r6.x, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.165770), r8.w
        ld_structured r9.y, r6.y, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.140321), r8.w
        ld_structured r9.y, r6.z, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.106289), r8.w
        store_structured g4.x, r0.x, l(0), r8.w
      endif 
      sync_g_t
      ld_structured r8.w, r0.z, l(0), g4.xxxx
      ld_structured r9.y, r7.x, l(0), g4.xxxx
      mul r9.y, r9.y, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.y
      ld_structured r9.y, r7.y, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r0.y, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.175240), r8.w
      ld_structured r9.y, r7.z, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r7.w, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.140321), r8.w
      ld_structured r9.y, r2.x, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.106289), r8.w
      mul r8.w, -r8.w, cb0[0].x
      mul r8.w, r8.w, l(1.442695)
      exp r8.w, r8.w
      ld_structured r13.x, r11.z, l(0), g0.xxxx
      ld_structured r13.y, r11.z, l(0), g1.xxxx
      ld_structured r13.z, r11.z, l(0), g2.xxxx
      add r9.x, r8.w, r9.x
      mad r11.xyz, r8.wwww, r13.xyzx, r12.xyzx
      ld_structured r12.x, r11.w, l(0), g0.xxxx
      ld_structured r12.y, r11.w, l(0), g1.xxxx
      ld_structured r12.z, r11.w, l(0), g2.xxxx
      ld_structured r13.x, r3.x, l(0), g0.xxxx
      ld_structured r13.y, r3.x, l(0), g1.xxxx
      ld_structured r13.z, r3.x, l(0), g2.xxxx
      add r12.xyz, -r12.xyzx, r13.xyzx
      dp3 r8.w, r12.xyzx, r12.xyzx
      store_structured g3.x, r1.w, l(0), r8.w
      if_nz r2.z
        iadd r8.w, r10.y, l(18)
        ld_structured r12.x, r8.w, l(0), g0.xxxx
        ld_structured r12.y, r8.w, l(0), g1.xxxx
        ld_structured r12.z, r8.w, l(0), g2.xxxx
        ld_structured r13.x, r3.y, l(0), g0.xxxx
        ld_structured r13.y, r3.y, l(0), g1.xxxx
        ld_structured r13.z, r3.y, l(0), g2.xxxx
        add r12.xyz, -r12.xyzx, r13.xyzx
        dp3 r8.w, r12.xyzx, r12.xyzx
        store_structured g3.x, r4.x, l(0), r8.w
      endif 
      if_nz r2.w
        iadd r8.w, r10.y, l(770)
        ld_structured r12.x, r8.w, l(0), g0.xxxx
        ld_structured r12.y, r8.w, l(0), g1.xxxx
        ld_structured r12.z, r8.w, l(0), g2.xxxx
        ld_structured r13.x, r3.z, l(0), g0.xxxx
        ld_structured r13.y, r3.z, l(0), g1.xxxx
        ld_structured r13.z, r3.z, l(0), g2.xxxx
        add r12.xyz, -r12.xyzx, r13.xyzx
        dp3 r8.w, r12.xyzx, r12.xyzx
        store_structured g3.x, r0.w, l(0), r8.w
        if_nz r2.z
          iadd r8.w, r10.y, l(786)
          ld_structured r12.x, r8.w, l(0), g0.xxxx
          ld_structured r12.y, r8.w, l(0), g1.xxxx
          ld_structured r12.z, r8.w, l(0), g2.xxxx
          ld_structured r13.x, r3.w, l(0), g0.xxxx
          ld_structured r13.y, r3.w, l(0), g1.xxxx
          ld_structured r13.z, r3.w, l(0), g2.xxxx
          add r12.xyz, -r12.xyzx, r13.xyzx
          dp3 r8.w, r12.xyzx, r12.xyzx
          store_structured g3.x, r5.x, l(0), r8.w
        endif 
      endif 
      sync_g_t
      ld_structured r8.w, r1.w, l(0), g3.xxxx
      ld_structured r9.y, r4.y, l(0), g3.xxxx
      mul r9.y, r9.y, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.y
      ld_structured r9.y, r4.z, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r4.w, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.175240), r8.w
      ld_structured r9.y, r1.x, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r1.y, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.140321), r8.w
      ld_structured r9.y, r1.z, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.106289), r8.w
      store_structured g4.x, r0.z, l(0), r8.w
      if_nz r2.w
        ld_structured r8.w, r0.w, l(0), g3.xxxx
        ld_structured r9.y, r5.y, l(0), g3.xxxx
        mul r9.y, r9.y, l(0.140321)
        mad r8.w, r8.w, l(0.106289), r9.y
        ld_structured r9.y, r5.z, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.165770), r8.w
        ld_structured r9.y, r5.w, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.175240), r8.w
        ld_structured r9.y, r6.x, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.165770), r8.w
        ld_structured r9.y, r6.y, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.140321), r8.w
        ld_structured r9.y, r6.z, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.106289), r8.w
        store_structured g4.x, r0.x, l(0), r8.w
      endif 
      sync_g_t
      ld_structured r8.w, r0.z, l(0), g4.xxxx
      ld_structured r9.y, r7.x, l(0), g4.xxxx
      mul r9.y, r9.y, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.y
      ld_structured r9.y, r7.y, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r0.y, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.175240), r8.w
      ld_structured r9.y, r7.z, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r7.w, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.140321), r8.w
      ld_structured r9.y, r2.x, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.106289), r8.w
      mul r8.w, -r8.w, cb0[0].x
      mul r8.w, r8.w, l(1.442695)
      exp r8.w, r8.w
      iadd r10.xzw, r10.xxyz, l(5, 0, 3, 6)
      ld_structured r12.x, r10.x, l(0), g0.xxxx
      ld_structured r12.y, r10.x, l(0), g1.xxxx
      ld_structured r12.z, r10.x, l(0), g2.xxxx
      add r9.x, r8.w, r9.x
      mad r11.xyz, r8.wwww, r12.xyzx, r11.xyzx
      ld_structured r12.x, r10.z, l(0), g0.xxxx
      ld_structured r12.y, r10.z, l(0), g1.xxxx
      ld_structured r12.z, r10.z, l(0), g2.xxxx
      ld_structured r13.x, r3.x, l(0), g0.xxxx
      ld_structured r13.y, r3.x, l(0), g1.xxxx
      ld_structured r13.z, r3.x, l(0), g2.xxxx
      add r12.xyz, -r12.xyzx, r13.xyzx
      dp3 r8.w, r12.xyzx, r12.xyzx
      store_structured g3.x, r1.w, l(0), r8.w
      if_nz r2.z
        iadd r8.w, r10.y, l(19)
        ld_structured r12.x, r8.w, l(0), g0.xxxx
        ld_structured r12.y, r8.w, l(0), g1.xxxx
        ld_structured r12.z, r8.w, l(0), g2.xxxx
        ld_structured r13.x, r3.y, l(0), g0.xxxx
        ld_structured r13.y, r3.y, l(0), g1.xxxx
        ld_structured r13.z, r3.y, l(0), g2.xxxx
        add r12.xyz, -r12.xyzx, r13.xyzx
        dp3 r8.w, r12.xyzx, r12.xyzx
        store_structured g3.x, r4.x, l(0), r8.w
      endif 
      if_nz r2.w
        iadd r8.w, r10.y, l(771)
        ld_structured r12.x, r8.w, l(0), g0.xxxx
        ld_structured r12.y, r8.w, l(0), g1.xxxx
        ld_structured r12.z, r8.w, l(0), g2.xxxx
        ld_structured r13.x, r3.z, l(0), g0.xxxx
        ld_structured r13.y, r3.z, l(0), g1.xxxx
        ld_structured r13.z, r3.z, l(0), g2.xxxx
        add r12.xyz, -r12.xyzx, r13.xyzx
        dp3 r8.w, r12.xyzx, r12.xyzx
        store_structured g3.x, r0.w, l(0), r8.w
        if_nz r2.z
          iadd r8.w, r10.y, l(787)
          ld_structured r10.x, r8.w, l(0), g0.xxxx
          ld_structured r10.y, r8.w, l(0), g1.xxxx
          ld_structured r10.z, r8.w, l(0), g2.xxxx
          ld_structured r12.x, r3.w, l(0), g0.xxxx
          ld_structured r12.y, r3.w, l(0), g1.xxxx
          ld_structured r12.z, r3.w, l(0), g2.xxxx
          add r10.xyz, -r10.xyzx, r12.xyzx
          dp3 r8.w, r10.xyzx, r10.xyzx
          store_structured g3.x, r5.x, l(0), r8.w
        endif 
      endif 
      sync_g_t
      ld_structured r8.w, r1.w, l(0), g3.xxxx
      ld_structured r9.y, r4.y, l(0), g3.xxxx
      mul r9.y, r9.y, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.y
      ld_structured r9.y, r4.z, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r4.w, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.175240), r8.w
      ld_structured r9.y, r1.x, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r1.y, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.140321), r8.w
      ld_structured r9.y, r1.z, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.106289), r8.w
      store_structured g4.x, r0.z, l(0), r8.w
      if_nz r2.w
        ld_structured r8.w, r0.w, l(0), g3.xxxx
        ld_structured r9.y, r5.y, l(0), g3.xxxx
        mul r9.y, r9.y, l(0.140321)
        mad r8.w, r8.w, l(0.106289), r9.y
        ld_structured r9.y, r5.z, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.165770), r8.w
        ld_structured r9.y, r5.w, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.175240), r8.w
        ld_structured r9.y, r6.x, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.165770), r8.w
        ld_structured r9.y, r6.y, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.140321), r8.w
        ld_structured r9.y, r6.z, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.106289), r8.w
        store_structured g4.x, r0.x, l(0), r8.w
      endif 
      sync_g_t
      ld_structured r8.w, r0.z, l(0), g4.xxxx
      ld_structured r9.y, r7.x, l(0), g4.xxxx
      mul r9.y, r9.y, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.y
      ld_structured r9.y, r7.y, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r0.y, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.175240), r8.w
      ld_structured r9.y, r7.z, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r7.w, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.140321), r8.w
      ld_structured r9.y, r2.x, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.106289), r8.w
      mul r8.w, -r8.w, cb0[0].x
      mul r8.w, r8.w, l(1.442695)
      exp r8.w, r8.w
      ld_structured r10.x, r10.w, l(0), g0.xxxx
      ld_structured r10.y, r10.w, l(0), g1.xxxx
      ld_structured r10.z, r10.w, l(0), g2.xxxx
      add r9.x, r8.w, r9.x
      mad r10.xyz, r8.wwww, r10.xyzx, r11.xyzx
      bfi r11.xyzw, l(4, 4, 4, 4), l(0, 0, 0, 0), l(1, 1, 2, 2), r9.zwzw
      iadd r11.xyzw, r11.xyzw, vThreadIDInGroup.xxxx
      iadd r12.xyzw, r11.xyzw, l(3, 6, 3, 6)
      ld_structured r13.x, r12.x, l(0), g0.xxxx
      ld_structured r13.y, r12.x, l(0), g1.xxxx
      ld_structured r13.z, r12.x, l(0), g2.xxxx
      ld_structured r14.x, r3.x, l(0), g0.xxxx
      ld_structured r14.y, r3.x, l(0), g1.xxxx
      ld_structured r14.z, r3.x, l(0), g2.xxxx
      add r13.xyz, -r13.xyzx, r14.xyzx
      dp3 r8.w, r13.xyzx, r13.xyzx
      store_structured g3.x, r1.w, l(0), r8.w
      if_nz r2.z
        iadd r8.w, r11.x, l(19)
        ld_structured r13.x, r8.w, l(0), g0.xxxx
        ld_structured r13.y, r8.w, l(0), g1.xxxx
        ld_structured r13.z, r8.w, l(0), g2.xxxx
        ld_structured r14.x, r3.y, l(0), g0.xxxx
        ld_structured r14.y, r3.y, l(0), g1.xxxx
        ld_structured r14.z, r3.y, l(0), g2.xxxx
        add r13.xyz, -r13.xyzx, r14.xyzx
        dp3 r8.w, r13.xyzx, r13.xyzx
        store_structured g3.x, r4.x, l(0), r8.w
      endif 
      if_nz r2.w
        iadd r8.w, r11.x, l(771)
        ld_structured r13.x, r8.w, l(0), g0.xxxx
        ld_structured r13.y, r8.w, l(0), g1.xxxx
        ld_structured r13.z, r8.w, l(0), g2.xxxx
        ld_structured r14.x, r3.z, l(0), g0.xxxx
        ld_structured r14.y, r3.z, l(0), g1.xxxx
        ld_structured r14.z, r3.z, l(0), g2.xxxx
        add r13.xyz, -r13.xyzx, r14.xyzx
        dp3 r8.w, r13.xyzx, r13.xyzx
        store_structured g3.x, r0.w, l(0), r8.w
        if_nz r2.z
          iadd r8.w, r11.x, l(787)
          ld_structured r13.x, r8.w, l(0), g0.xxxx
          ld_structured r13.y, r8.w, l(0), g1.xxxx
          ld_structured r13.z, r8.w, l(0), g2.xxxx
          ld_structured r14.x, r3.w, l(0), g0.xxxx
          ld_structured r14.y, r3.w, l(0), g1.xxxx
          ld_structured r14.z, r3.w, l(0), g2.xxxx
          add r11.xyw, -r13.xyxz, r14.xyxz
          dp3 r8.w, r11.xywx, r11.xywx
          store_structured g3.x, r5.x, l(0), r8.w
        endif 
      endif 
      sync_g_t
      ld_structured r8.w, r1.w, l(0), g3.xxxx
      ld_structured r9.y, r4.y, l(0), g3.xxxx
      mul r9.y, r9.y, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.y
      ld_structured r9.y, r4.z, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r4.w, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.175240), r8.w
      ld_structured r9.y, r1.x, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r1.y, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.140321), r8.w
      ld_structured r9.y, r1.z, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.106289), r8.w
      store_structured g4.x, r0.z, l(0), r8.w
      if_nz r2.w
        ld_structured r8.w, r0.w, l(0), g3.xxxx
        ld_structured r9.y, r5.y, l(0), g3.xxxx
        mul r9.y, r9.y, l(0.140321)
        mad r8.w, r8.w, l(0.106289), r9.y
        ld_structured r9.y, r5.z, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.165770), r8.w
        ld_structured r9.y, r5.w, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.175240), r8.w
        ld_structured r9.y, r6.x, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.165770), r8.w
        ld_structured r9.y, r6.y, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.140321), r8.w
        ld_structured r9.y, r6.z, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.106289), r8.w
        store_structured g4.x, r0.x, l(0), r8.w
      endif 
      sync_g_t
      ld_structured r8.w, r0.z, l(0), g4.xxxx
      ld_structured r9.y, r7.x, l(0), g4.xxxx
      mul r9.y, r9.y, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.y
      ld_structured r9.y, r7.y, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r0.y, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.175240), r8.w
      ld_structured r9.y, r7.z, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r7.w, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.140321), r8.w
      ld_structured r9.y, r2.x, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.106289), r8.w
      mul r8.w, -r8.w, cb0[0].x
      mul r8.w, r8.w, l(1.442695)
      exp r8.w, r8.w
      ld_structured r13.x, r12.y, l(0), g0.xxxx
      ld_structured r13.y, r12.y, l(0), g1.xxxx
      ld_structured r13.z, r12.y, l(0), g2.xxxx
      add r9.x, r8.w, r9.x
      mad r10.xyz, r8.wwww, r13.xyzx, r10.xyzx
      ld_structured r13.x, r12.z, l(0), g0.xxxx
      ld_structured r13.y, r12.z, l(0), g1.xxxx
      ld_structured r13.z, r12.z, l(0), g2.xxxx
      ld_structured r12.x, r3.x, l(0), g0.xxxx
      ld_structured r12.y, r3.x, l(0), g1.xxxx
      ld_structured r12.z, r3.x, l(0), g2.xxxx
      add r11.xyw, -r13.xyxz, r12.xyxz
      dp3 r8.w, r11.xywx, r11.xywx
      store_structured g3.x, r1.w, l(0), r8.w
      if_nz r2.z
        iadd r8.w, r11.z, l(19)
        ld_structured r12.x, r8.w, l(0), g0.xxxx
        ld_structured r12.y, r8.w, l(0), g1.xxxx
        ld_structured r12.z, r8.w, l(0), g2.xxxx
        ld_structured r13.x, r3.y, l(0), g0.xxxx
        ld_structured r13.y, r3.y, l(0), g1.xxxx
        ld_structured r13.z, r3.y, l(0), g2.xxxx
        add r11.xyw, -r12.xyxz, r13.xyxz
        dp3 r8.w, r11.xywx, r11.xywx
        store_structured g3.x, r4.x, l(0), r8.w
      endif 
      if_nz r2.w
        iadd r8.w, r11.z, l(771)
        ld_structured r12.x, r8.w, l(0), g0.xxxx
        ld_structured r12.y, r8.w, l(0), g1.xxxx
        ld_structured r12.z, r8.w, l(0), g2.xxxx
        ld_structured r13.x, r3.z, l(0), g0.xxxx
        ld_structured r13.y, r3.z, l(0), g1.xxxx
        ld_structured r13.z, r3.z, l(0), g2.xxxx
        add r11.xyw, -r12.xyxz, r13.xyxz
        dp3 r8.w, r11.xywx, r11.xywx
        store_structured g3.x, r0.w, l(0), r8.w
        if_nz r2.z
          iadd r8.w, r11.z, l(787)
          ld_structured r11.x, r8.w, l(0), g0.xxxx
          ld_structured r11.y, r8.w, l(0), g1.xxxx
          ld_structured r11.z, r8.w, l(0), g2.xxxx
          ld_structured r12.x, r3.w, l(0), g0.xxxx
          ld_structured r12.y, r3.w, l(0), g1.xxxx
          ld_structured r12.z, r3.w, l(0), g2.xxxx
          add r11.xyz, -r11.xyzx, r12.xyzx
          dp3 r8.w, r11.xyzx, r11.xyzx
          store_structured g3.x, r5.x, l(0), r8.w
        endif 
      endif 
      sync_g_t
      ld_structured r8.w, r1.w, l(0), g3.xxxx
      ld_structured r9.y, r4.y, l(0), g3.xxxx
      mul r9.y, r9.y, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.y
      ld_structured r9.y, r4.z, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r4.w, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.175240), r8.w
      ld_structured r9.y, r1.x, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r1.y, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.140321), r8.w
      ld_structured r9.y, r1.z, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.106289), r8.w
      store_structured g4.x, r0.z, l(0), r8.w
      if_nz r2.w
        ld_structured r8.w, r0.w, l(0), g3.xxxx
        ld_structured r9.y, r5.y, l(0), g3.xxxx
        mul r9.y, r9.y, l(0.140321)
        mad r8.w, r8.w, l(0.106289), r9.y
        ld_structured r9.y, r5.z, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.165770), r8.w
        ld_structured r9.y, r5.w, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.175240), r8.w
        ld_structured r9.y, r6.x, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.165770), r8.w
        ld_structured r9.y, r6.y, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.140321), r8.w
        ld_structured r9.y, r6.z, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.106289), r8.w
        store_structured g4.x, r0.x, l(0), r8.w
      endif 
      sync_g_t
      ld_structured r8.w, r0.z, l(0), g4.xxxx
      ld_structured r9.y, r7.x, l(0), g4.xxxx
      mul r9.y, r9.y, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.y
      ld_structured r9.y, r7.y, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r0.y, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.175240), r8.w
      ld_structured r9.y, r7.z, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r7.w, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.140321), r8.w
      ld_structured r9.y, r2.x, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.106289), r8.w
      mul r8.w, -r8.w, cb0[0].x
      mul r8.w, r8.w, l(1.442695)
      exp r8.w, r8.w
      ld_structured r11.x, r12.w, l(0), g0.xxxx
      ld_structured r11.y, r12.w, l(0), g1.xxxx
      ld_structured r11.z, r12.w, l(0), g2.xxxx
      add r9.x, r8.w, r9.x
      mad r10.xyz, r8.wwww, r11.xyzx, r10.xyzx
      bfi r9.yz, l(0, 4, 4, 0), l(0, 0, 0, 0), l(0, 3, 3, 0), r9.zzwz
      iadd r9.yz, r9.yyzy, vThreadIDInGroup.xxxx
      iadd r9.zw, r9.yyyz, l(0, 0, 3, 6)
      ld_structured r11.x, r9.z, l(0), g0.xxxx
      ld_structured r11.y, r9.z, l(0), g1.xxxx
      ld_structured r11.z, r9.z, l(0), g2.xxxx
      ld_structured r12.x, r3.x, l(0), g0.xxxx
      ld_structured r12.y, r3.x, l(0), g1.xxxx
      ld_structured r12.z, r3.x, l(0), g2.xxxx
      add r11.xyz, -r11.xyzx, r12.xyzx
      dp3 r8.w, r11.xyzx, r11.xyzx
      store_structured g3.x, r1.w, l(0), r8.w
      if_nz r2.z
        iadd r8.w, r9.y, l(19)
        ld_structured r11.x, r8.w, l(0), g0.xxxx
        ld_structured r11.y, r8.w, l(0), g1.xxxx
        ld_structured r11.z, r8.w, l(0), g2.xxxx
        ld_structured r12.x, r3.y, l(0), g0.xxxx
        ld_structured r12.y, r3.y, l(0), g1.xxxx
        ld_structured r12.z, r3.y, l(0), g2.xxxx
        add r11.xyz, -r11.xyzx, r12.xyzx
        dp3 r8.w, r11.xyzx, r11.xyzx
        store_structured g3.x, r4.x, l(0), r8.w
      endif 
      if_nz r2.w
        iadd r8.w, r9.y, l(771)
        ld_structured r11.x, r8.w, l(0), g0.xxxx
        ld_structured r11.y, r8.w, l(0), g1.xxxx
        ld_structured r11.z, r8.w, l(0), g2.xxxx
        ld_structured r12.x, r3.z, l(0), g0.xxxx
        ld_structured r12.y, r3.z, l(0), g1.xxxx
        ld_structured r12.z, r3.z, l(0), g2.xxxx
        add r11.xyz, -r11.xyzx, r12.xyzx
        dp3 r8.w, r11.xyzx, r11.xyzx
        store_structured g3.x, r0.w, l(0), r8.w
        if_nz r2.z
          iadd r8.w, r9.y, l(787)
          ld_structured r11.x, r8.w, l(0), g0.xxxx
          ld_structured r11.y, r8.w, l(0), g1.xxxx
          ld_structured r11.z, r8.w, l(0), g2.xxxx
          ld_structured r12.x, r3.w, l(0), g0.xxxx
          ld_structured r12.y, r3.w, l(0), g1.xxxx
          ld_structured r12.z, r3.w, l(0), g2.xxxx
          add r11.xyz, -r11.xyzx, r12.xyzx
          dp3 r8.w, r11.xyzx, r11.xyzx
          store_structured g3.x, r5.x, l(0), r8.w
        endif 
      endif 
      sync_g_t
      ld_structured r8.w, r1.w, l(0), g3.xxxx
      ld_structured r9.y, r4.y, l(0), g3.xxxx
      mul r9.y, r9.y, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.y
      ld_structured r9.y, r4.z, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r4.w, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.175240), r8.w
      ld_structured r9.y, r1.x, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r1.y, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.140321), r8.w
      ld_structured r9.y, r1.z, l(0), g3.xxxx
      mad r8.w, r9.y, l(0.106289), r8.w
      store_structured g4.x, r0.z, l(0), r8.w
      if_nz r2.w
        ld_structured r8.w, r0.w, l(0), g3.xxxx
        ld_structured r9.y, r5.y, l(0), g3.xxxx
        mul r9.y, r9.y, l(0.140321)
        mad r8.w, r8.w, l(0.106289), r9.y
        ld_structured r9.y, r5.z, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.165770), r8.w
        ld_structured r9.y, r5.w, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.175240), r8.w
        ld_structured r9.y, r6.x, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.165770), r8.w
        ld_structured r9.y, r6.y, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.140321), r8.w
        ld_structured r9.y, r6.z, l(0), g3.xxxx
        mad r8.w, r9.y, l(0.106289), r8.w
        store_structured g4.x, r0.x, l(0), r8.w
      endif 
      sync_g_t
      ld_structured r8.w, r0.z, l(0), g4.xxxx
      ld_structured r9.y, r7.x, l(0), g4.xxxx
      mul r9.y, r9.y, l(0.140321)
      mad r8.w, r8.w, l(0.106289), r9.y
      ld_structured r9.y, r7.y, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r0.y, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.175240), r8.w
      ld_structured r9.y, r7.z, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.165770), r8.w
      ld_structured r9.y, r7.w, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.140321), r8.w
      ld_structured r9.y, r2.x, l(0), g4.xxxx
      mad r8.w, r9.y, l(0.106289), r8.w
      mul r8.w, -r8.w, cb0[0].x
      mul r8.w, r8.w, l(1.442695)
      exp r8.w, r8.w
      ld_structured r11.x, r9.w, l(0), g0.xxxx
      ld_structured r11.y, r9.w, l(0), g1.xxxx
      ld_structured r11.z, r9.w, l(0), g2.xxxx
      add r6.w, r8.w, r9.x
      mad r8.xyz, r8.wwww, r11.xyzx, r10.xyzx
      iadd r2.y, r2.y, l(1)
    endloop 
    ult r0.xy, vThreadID.xyxx, cb0[0].yzyy
    and r0.x, r0.y, r0.x
    if_nz r0.x
      div r0.xyz, r8.xyzx, r6.wwww
      mad r0.xyz, r0.xyzx, l(255.000000, 255.000000, 255.000000, 0.000000), l(0.500000, 0.500000, 0.500000, 0.000000)
      max r0.xyz, r0.xyzx, l(0.000000, 0.000000, 0.000000, 0.000000)
      min r0.xyz, r0.xyzx, l(255.000000, 255.000000, 255.000000, 0.000000)
      ftou r0.xyz, r0.xyzx
      bfi r0.y, l(24), l(8), r0.y, r0.z
      bfi r0.x, l(16), l(16), r0.x, r0.y
      imad r0.y, vThreadID.y, cb0[0].y, vThreadID.x
      ishl r0.y, r0.y, l(2)
      store_raw u0.x, r0.y, r0.x
    endif 
    ret 
    // Approximately 799 instruction slots used
    
    The ISA uses 59 GPRs and only 64 scratch registers.

    This version of GPUSA is reporting the August 09 DX SDK. Catalyst is 9.12.

    Jawed
     
  10. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Hmm, maybe it doesn't like the double-nesting... have you tried reforming that as a single loop with address logic and see if that goes through? That's not to saw that it shouldn't go through in general, but it might be a bug.

    [Edit] Yup, compiles fine if you reformulate it like so:
    Code:
    uint kernel_size = kernel_half * 2 + 1;
    [loop] for(uint ij = 0; ij < (kernel_size*kernel_size); ++ij) {
        uint i = ij / kernel_size - kernel_half;
        uint j = ij % kernel_size - kernel_half;
        ...
    
    Sorry if there are bugs... didn't test the code, but you get the idea. Also the above code can be made more efficient.
     
    #50 Andrew Lauritzen, May 4, 2010
    Last edited by a moderator: May 4, 2010
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    As an experiment I managed to set things in GPUSA so that it would take 7 minutes to fail compilation with a similar message:

    Code:
    ERROR 0:109: (109,4): warning X3574: synchronization operations cannot be used in varying flow control, forcing loop to unroll
    ERROR 0:74: error X3511: Unable to unroll loop, loop does not appear to terminate in a timely manner (148 iterations), use the [unroll(n)] attribute to force an exact higher number
    
    I changed:

    Code:
    cbuffer cb0
    {
     float g_sigma;
     int2 g_imagesize;
     uint g_kernel_half ;
    }
    Code:
    #undef kernel_half
    [numthreads(num_threads_x, num_threads_y, 1)]
    void CSMain(uint3 Gid : SV_DispatchThreadID, uint3 Tid : SV_GroupThreadID)
    {
     int i, j, l;
     uint kernel_half ;
     float3 total_color = float3(0.0f, 0.0f, 0.0f);
     float total_weight = 0.0f;
     float3 c;
     kernel_half = g_kernel_half ;
    and removed the [unroll] pragmas. Then I set GPUSA's option Flow Control = Prefer.

    So I suspect your compilation is taking ages because the DX compiler is being called with "Flow Control = Prefer".

    Jawed
     
  12. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    This is a good idea. Apparently this bug wasn't fixed in August 2009 SDK and Febrary 2010 SDK. I'll try it to see how the performance goes (the unrolled version takes around 300 ~ 400 ms to run, which is several times slower than the equivalent OpenCL version).
     
  13. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    1. Ok, I see what you mean about the benchmarks and I agree 100%, I imagine that the benchmarks mentioned probably run horribly on AMD GPUs.

    2. If you are a single person running a single kernel then ok. But there are a TON of ways to optimize the GPU when used in different environments.

    3. To the person who was talking about generic code: There are currently really only two GPU makes competing... Nvidia and AMD. I agree that Nvidia makes it easier to code, but if you are at all interested in HPC and you expect generic code (even for CPUs) to run efficiently on diff archs based on compilation then I'm not sure you are in HPC. For example, it's very difficult for a compiler to properly vectorize code.

    4. Heterogenous systems.. yes, hence OpenCL (psst it's not just for GPUs). Also, hence the async transfers and async kernel calls.

    5. Has anyone looked at the SiSoftware Sandra OpenCL benchmarks? I'm just curious.
     
    #53 ryta1203, May 5, 2010
    Last edited by a moderator: May 5, 2010
  14. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    By using a single layer loop, the code is now compilable with CS:

    Code:
    // NLM shader
    // Copyright(c) 2009 Ping-Che Chen 
    
    #define num_threads_x 16
    #define num_threads_y 16
    #define kernel_half 3
    
    
    Texture2D Input : register(t0);
    RWByteAddressBuffer Result : register(u0);
    
    
    cbuffer cb0
    {
    	float g_sigma;
    	int2 g_imagesize;
    }
    
    
    static const float gaussian[7] = {
    	0.1062888f, 0.1403214f, 0.1657702f, 0.1752401f, 0.1657702f, 0.1403214f, 0.1062888f
    };
    
    
    groupshared float shared_r[(num_threads_y + kernel_half * 4) * 48];
    groupshared float shared_g[(num_threads_y + kernel_half * 4) * 48];
    groupshared float shared_b[(num_threads_y + kernel_half * 4) * 48];
    groupshared float dots[(num_threads_y + kernel_half * 2) * 48];
    groupshared float dots2[(num_threads_y + kernel_half * 2) * 16];
    
    
    [numthreads(num_threads_x, num_threads_y, 1)]
    void CSMain(uint3 Gid : SV_DispatchThreadID, uint3 Tid : SV_GroupThreadID)
    {
    	int i, j, k, l;
    
    	float3 total_color = float3(0.0f, 0.0f, 0.0f);
    	float total_weight = 0.0f;
    	float3 c;
    
    	for(i = Tid.y; i < num_threads_y + kernel_half * 4; i += num_threads_y) {
    		for(j = Tid.x; j < num_threads_x + kernel_half * 4; j += num_threads_x) {
    			c = Input.Load(clamp(int3(Gid.x + j - Tid.x - kernel_half * 2, Gid.y + i - Tid.y - kernel_half * 2, 0), int3(0, 0, 0), int3(g_imagesize.x, g_imagesize.y, 0)));
    			shared_r[i * 48 + j] = c.r;
    			shared_g[i * 48 + j] = c.g;
    			shared_b[i * 48 + j] = c.b;
    		}
    	}
    		
    	GroupMemoryBarrierWithGroupSync();
    
    	i = -kernel_half;
    	j = -kernel_half;
    	[loop]
    	for(k = 0; k < (kernel_half * 2 + 1) * (kernel_half * 2 + 1); k++) {
    		float total;
    		float3 c1, c2, cd, cp;
    
    		int y1 = (i + Tid.y + kernel_half) * 48 + j + Tid.x + kernel_half;
    		int y2 = (Tid.y + kernel_half) * 48 + Tid.x + kernel_half;			
    		
    		c1 = float3(shared_r[y1], shared_g[y1], shared_b[y1]);
    		c2 = float3(shared_r[y2], shared_g[y2], shared_b[y2]);
    		cd = (c2 - c1);
    		dots[Tid.y * 48 + Tid.x] = dot(cd, cd);
    		
    		if(Tid.x < kernel_half * 2) {
    			c1 = float3(shared_r[y1 + num_threads_x], shared_g[y1 + num_threads_x], shared_b[y1 + num_threads_x]);
    			c2 = float3(shared_r[y2 + num_threads_x], shared_g[y2 + num_threads_x], shared_b[y2 + num_threads_x]);
    			cd = (c2 - c1);
    			dots[Tid.y * 48 + Tid.x + num_threads_x] = dot(cd, cd);
    		}
    		
    		if(Tid.y < kernel_half * 2) {
    			c1 = float3(shared_r[y1 + num_threads_y * 48], shared_g[y1 + num_threads_y * 48], shared_b[y1 + num_threads_y * 48]);
    			c2 = float3(shared_r[y2 + num_threads_y * 48], shared_g[y2 + num_threads_y * 48], shared_b[y2 + num_threads_y * 48]);
    			cd = (c2 - c1);
    			dots[(Tid.y + num_threads_y) * 48 + Tid.x] = dot(cd, cd);
    			
    			if(Tid.x < kernel_half * 2) {
    				c1 = float3(shared_r[y1 + num_threads_y * 48 + num_threads_x], shared_g[y1 + num_threads_y * 48 + num_threads_x], shared_b[y1 + num_threads_y * 48 + num_threads_x]);
    				c2 = float3(shared_r[y2 + num_threads_y * 48 + num_threads_x], shared_g[y2 + num_threads_y * 48 + num_threads_x], shared_b[y2 + num_threads_y * 48 + num_threads_x]);
    				cd = (c2 - c1);
    				dots[(Tid.y + num_threads_y) * 48 + Tid.x + num_threads_x] = dot(cd, cd);
    			}
    		}
    		
    		GroupMemoryBarrierWithGroupSync();
    				
    		float d = 0.0f;		
    		[unroll]
    		for(l = 0; l <= kernel_half * 2; l++) {
    			d += dots[Tid.y * 48 + Tid.x + l] * gaussian[l];
    		}
    		
    		dots2[Tid.y * num_threads_x + Tid.x] = d;
    		
    		if(Tid.y < kernel_half * 2) {
    			d = 0.0f;
    			[unroll]
    			for(l = 0; l <= kernel_half * 2; l++) {
    				d += dots[(Tid.y + num_threads_y) * 48 + Tid.x + l] * gaussian[l];
    			}
    			
    			dots2[(Tid.y + num_threads_y) * num_threads_x + Tid.x] = d;
    		}
    		
    		GroupMemoryBarrierWithGroupSync();
    
    		d = 0.0f;
    		
    		[unroll]
    		for(l = 0; l <= kernel_half * 2; l++) {
    			d += dots2[(Tid.y + l) * num_threads_x + Tid.x] * gaussian[l];
    		}
    
    		total = exp(-d * g_sigma);
    		
    		cp = float3(shared_r[(i + Tid.y + kernel_half * 2) * 48 + j + Tid.x + kernel_half * 2],
    			shared_g[(i + Tid.y + kernel_half * 2) * 48 + j + Tid.x + kernel_half * 2],
    			shared_b[(i + Tid.y + kernel_half * 2) * 48 + j + Tid.x + kernel_half * 2]);
    			
    		total_weight += total;
    		total_color += total * cp;
    		
    		j++;
    		if(j > kernel_half) {
    			i++;
    			j = -kernel_half;
    		}
    	}
    
    	float3 vColor = total_color / total_weight;
    
    	if(Gid.x < g_imagesize.x && Gid.y < g_imagesize.y) {
    		float3 vc = vColor * 255 + 0.5;
    		vc = clamp(vc, 0, 255);
    		uint x = vc.x;
    		uint y = vc.y;
    		uint z = vc.z;
    		uint c = z + (y << 8) + (x << 16);
    		Result.Store((Gid.y * g_imagesize.x + Gid.x) * 4, c);
    	}
    }
    
    This now runs properly and takes only around 66ms to run on my Radeon 5850. It's slightly faster than the roughly equivalent OpenCL version, which takes around 77 ms. Unfortunately, since GTX 285 does not support CS 5.0 I can't run it with GTX 285. GTX 285 is still faster in OpenCL though (around 39ms).
     
  15. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    Looking at the code one would certainly expect this code to run faster on Nvidia hardware.
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    pcchen, what happens if you remove the 3 [unroll] pragmas?

    Using GPUSA the code as posted uses 36GPRs. Removing those pragmas it comes down to 19 GPRs. With 36 GPRs you'll get one workgroup (of 256 work items = 4 hardware threads) per SIMD, as only 6 or 7 (hard to be sure) hardware threads are allocatable.

    With 19 GPRs you will get 3 workgroups per SIMD (12/13 hardware threads). That looks like it'll fit within the allowed 32KB of thread local share memory available per SIMD.

    Jawed
     
  17. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Removing [unroll] or adding [loop] make it a little slower, from 63 ~ 66ms to 68 ~ 71 ms. The loops are small so the overhead of not unrolling is probably too big.
    It's possible to "vectorize" this kernel by making it do 2 pixels at once, but that probably needs more registers.
    I also tried to unroll the small loops by 2 to make it easier to vectorize by the compiler (like this):

    Code:
    		float d1 = 0.0f, d2 = 0.0f;
    		[unroll]
    		for(l = 0; l < kernel_half; l++) {
    			d1 += dots[Tid.y * 48 + Tid.x + l] * gaussian[l];
    			d2 += dots[Tid.y * 48 + Tid.x + l + kernel_half] * gaussian[l + kernel_half];
    		}
    		
    		dots2[Tid.y * num_threads_x + Tid.x] = d1 + d2 + dots[Tid.y * 48 + Tid.x + kernel_half * 2] * gaussian[kernel_half * 2];
    
    but the effect is quite small.
     
  18. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    I haven't had a chance to take a detailed look at the code but a few questions/suggestions...

    1) Why do you split up the RGB arrays in local memory? Just storing them as a float3 array would simplify your code and reduce some of the addressing math. Gather/scatter in local memory is full speed so there should be no need for SoA-style in this case.

    2) I need to get my head around the overall execution structure of the kernel but it seems like a few things could be reformulated to be a bit more efficient. For instance, most of the branching is on small blocks (6 for instance). The branch on Tid.y is particularly bad as warps are typically laid out in row-major order. There is usually a better way to lay out the problem domain to avoid this, but I need to look at it in detail. At the very least flip the nesting order (i.e. nest the Tid.y branch inside the Tid.x branch above instead of the other way around). Again though, it seems like a slight reformulation of the execution domain could help out a lot here, but I just need to draw out how your data is getting mapped into the "dots" arrays to see whether there's potentially a more efficient way.

    3) Rather than using a byte addressable buffer as an output, try a standard RWTexture2D<float4> bound to a RGBA8_UNORM. You can't read from these UAVs currently but you only need to store here and so you can avoid the format conversion math (and simplify the code) and make use of the ROPs.

    4) You don't need the "clamp()" stuff on your initial texture load at the start. OOB reads from global arrays return 0, which should be safe for your kernel by my brief reading.

    5) The initial read is a bit awkward in that it is indexed on NEGATIVE thread ID's. This would disrupt any coalescing from global memory reads. Probably worth reformulating that to instead do the swap when storing into local memory if necessary. Always think of reading/writing global memory in blocks that look roughly linear (like your thread IDs).

    6) I get minor twitches seeing "48"s scattered all around the code :) I assume this is related to num_threads_x*3? Changing to float3 arrays (and maybe even 2D arrays) in local memory should make this unnecessary and perhaps a bit faster.

    Ironically to write efficient GPU code you often have to think more about how you'd write the more conventional SIMD-style code and then how to morph that into something in HLSL that will end up generating that on the GPU, rather than writing independent "thread"-like code. Thus it's often best to look at the data access patterns and map those to SIMD execution and data structures, then express that in HLSL code. The multi-dimensional thread indices, domain "blocks", etc. often just get in the way and provide too many different ways to express an algorithm, most of which are horribly inefficient when mapped to the hardware.

    There's really no way that the 5850 should be losing to the 285, particularly by that sort of margin. I expect restructuring your code a bit will not only close the gap but improve the performance on the 285 as well.
     
    #58 Andrew Lauritzen, May 5, 2010
    Last edited by a moderator: May 5, 2010
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Looking at the ISA generated by GPUSA for the two scenarios, the code is almost exactly the same :shock:

    Both end up having those 3 loops unrolled. It seems that the pragmas affect the D3D compiler, but either way ATI produces similarly structured ISA.

    It seems the main effect of [unroll] is that the ISA contains more "pre-computed" stuff before the main loop gets underway. The [unroll] version is ~15 ALU cycles faster per iteration, it seems. (Of course both GPUSA's compilers are way out of date...)

    I suppose the other thing to play with is LDS banking. I wonder if %16 that you're using is conflicting with %32 that the hardware prefers? %16 suits GTX285.

    Yeah, which raises the question of multi-pixel combined with [unroll] (latter saving GPRs).

    That change improves the main loop to 70 cycles and 11 clauses versus 97 cycles and 14 clauses. But I suppose that's worth <10ms judging from the ~15 cycles saving of [unroll] (which has the same clause count). EDIT: I did it wrong, ignore all this.

    Jawed
     
    #59 Jawed, May 5, 2010
    Last edited by a moderator: May 5, 2010
  20. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,785
    Likes Received:
    173
    Location:
    Taiwan
    Using float3 seems to cause LDS bank conflict. That's also why there are "48" everywhere :p

    This is difficult because the kernel has to access overlapped data (i.e. to do 16x16 NLM it needs (16+12)*(16+12) pixels).

    Possible, but since each pixel only writes once, this probably doesn't matter much.

    Again, this probably doesn't matter much since it's only done at most 4 times per pixel.

    That's why I read it from a texture :) In the OpenCL version, there is a "pre-arrange" kernel which shift the data a few pixels to make the main kernel reads all coalesced.

    As explained, this is to avoid LDS bank conflict (at least in the OpenCL version).

    Well, the ALU packing rate is not very good. I have tried many different approaches in the OpenCL version (the original one uses a uchar4 array instead of three float arrays) and the CS version is modified from the best performing one.

    [EDIT] I did a quick modification from three float to one float3 array, and it makes no difference in performance. Apparently CS has its own way to arrange the banks :)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...