GPCBenchmark - A OpenCL General Purpose Computing benchmark

Discussion in 'GPGPU Technology & Programming' started by Arnold Beckenbauer, Apr 30, 2010.

  1. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,774
    Likes Received:
    156
    Location:
    Taiwan
    The *48 is already for the 32 LDS banks of Evergreen (this was written long time ago, so I hope I got my memories right :p ). There are three arrays to consider: the three shared_r/g/b arrays, the dots array, and the dots2 array.

    The shared_r/g/b arrays need to contain (16 + 12) * (16 + 12) pixels. Basically each thread only access one pixel (and the border pixels, i.e. those +12 ones). However, if using (16 + 12) * (16 + 12) there will be bank conflicts, because it'll be like this:

    Thread 0 ~ thread 15 accessing 0 ~ 15, that's ok.
    Thread 16 ~ thread 31 (Tid.y = 1) accessing 28 ~ 44, that's NOT ok (as they are bank 28 ~ 31 and bank 0 ~ 11). There will be bank conflicts.
    And so on.

    To avoid bank conflict, it's important that thread 16 ~ thread 31 is accessing bank 16 ~ 31. That's how * 48 comes in. When the array is (16 + 12) * 48, it will be like:

    Thread 0 ~ thread 15 accessing 0 ~ 15.
    Thread 16 ~ thread 31 (Tid.y = 1) accessing 48 ~ 63, that's bank 16 ~ 31. No bank conflicts.
    And so on.

    The same goes to the dots array, though it only needs to be larger than (16+6) * (16+6).

    The dots2 array is a bit different. It's smaller because it only needs to be (16 + 6) * 16. But * 16 is actually ok because there is no bank conflict:

    Thread 0 ~ thread 15 accessing 0 ~ 15.
    Thread 16 ~ thread 31 (Tid.y = 1) accessing 16 ~ 31. No bank conflicts.
    And so on.

    That's why there are ugly 48 everywhere.

    In the OpenCL version, it's written as:

    #define PACK_OFFSET1 (48 - THREAD_X - KERNEL_OFFSET * 2)

    which is a bit better but basically it's the same as using * 48 everywhere (because THREAD_X - KERNEL_OFFSET * 2 can't be larger than 48, basically one needs to find out a 32*n + 16 >= THREAD_X - KERNEL_OFFSET * 2).
     
  2. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yeah, but I still feel like there might be a different way to arrange the kernel to better capture the coherency. I'll think about it a bit more later when I have time to sit down with the code.

    Yup true, but simpler is always better :) I haven't used byte addressed buffers much so they may or may not carry a performance penalty. Seems like a RWTexture2D fits your usage perfectly in any case.

    Might be true, but have you benchmarked which part of the kernel is hurting the most? You can do a lot of stuff in local memory for every global memory access.

    HLSL generates the same instruction for "Texture.Load" as "Texture/Buffer[index]", so it's not clear that this is actually going through the sampler/cache. It may be, but you could try using a Sample instruction w/ point sampling to be sure.

    Fair enough, although I wouldn't assume that OpenCL code maps directly to DC in terms of trade-offs. Not only is there a compiler from a 3rd party involved in DC (for good or for bad), but there are a lot of implementation choices left up in the air by both specs. On ATI I've had the best luck writing and optimizing directly in DC compared to OpenCL, but presumably both will mature further.
     
  3. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0
    For AMD, you need to vectorize and then reduce control flow... it looks like you have a lot of control flow... CAL compiler doesn't allow for packing across CF clauses, so get rid of them, this will increase packing (of course).

    Also, if registers are a problem, is it possible to split your code into two kernels? This "might" increase performance.
     
  4. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,774
    Likes Received:
    156
    Location:
    Taiwan
    I didn't benchmarked the CS version, but in the OpenCL version, the "pre-arrangement" kernel, which has quite a few bits non-coalesced access, takes less than 1 ms in kernel time. So it looks like the center loop is the time killer.
    I did a simple test by removing the texture loads in the CS version, and filled the shared array with some random data (like i * 48 + j or something). It doesn't change the execution time though.

    I have not much experiences writing for DirectCompute though. Most of my experiences are with CUDA and OpenCL. On ATI I agree DirectCompute looks to perform better than OpenCL, at least for now :)
     
  5. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,774
    Likes Received:
    156
    Location:
    Taiwan
    I don't know if registers are problem here though. To me it looks like it shouldn't be using too many registers.
    A simple vectorization is to make a thread handles two (or more) pixels at once. This can be done by splitting the image by 2. Control flow is a harder problem though.

    [EDIT] One way to get rid of those control flows is to make the dots and dots2 arrays large enough and performs redundant computations. I don't know if it's going to be better but I'll try it later.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Thanks for the explanation. I also got the workgroup total shared memory wrong earlier, so I prolly need to go to bed. It's 21760 bytes which means there can only be 1 workgroup per SIMD, so reducing GPR count would have made no difference to latency-hiding.

    Seems you have to roll your own packed format/operations with CS5 in order to match OpenCL's support for uchar4. But that would cut the count of LDS operations substantially as well as the amount of space used and therefore allowing >1 workgroup to share each SIMD (provided no unrolling is used).

    Since the earlier attempt at reducing GPRs couldn't directly affect latency-hiding, perhaps packing would make the key difference.

    Jawed
     
  7. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,774
    Likes Received:
    156
    Location:
    Taiwan
    Yeah, now this looks like a very good idea :)
    On GTX 285, bank conflict is not a problem (since it has 16 banks and the threads are arranged 16x16), the amount of shared memory usage is well under control. I compared uchar4 and three floats in OpenCL, and they didn't make any difference for GTX 285 (IIRC three floats are a little slower, but I could be wrong since it's long time ago).

    Now I modified a "custom" uchar4 version for DirectCompute, and it performs well (it also contain the "no flow control" modification, but the "no flow control" modification has a small effect, about 3 ms, probably due to those redundant computation). Now it takes around 53ms to run. The code is as following:

    Code:
    // NLM shader
    // Copyright(c) 2009 Ping-Che Chen 
    
    #define num_threads_x 16
    #define num_threads_y 16
    #define kernel_half 3
    
    
    Texture2D Input : register(t0);
    SamplerState PointSampler : register(s0);
    RWByteAddressBuffer Result : register(u0);
    
    
    cbuffer cb0
    {
    	float g_sigma;
    	int2 g_imagesize;
    }
    
    
    static const float gaussian[7] = {
    	0.1062888f, 0.1403214f, 0.1657702f, 0.1752401f, 0.1657702f, 0.1403214f, 0.1062888f
    };
    
    
    groupshared uint sharedc[(num_threads_y * 2) * 48];
    groupshared float dots[(num_threads_y * 2) * 48];
    groupshared float dots2[(num_threads_y * 2) * 16];
    
    
    float3 transform_color(uint c)
    {
    	int r = (c >> 16) & 0xff;
    	int g = (c >> 8) & 0xff;
    	int b = c & 0xff;
    	return float3(r, g, b);
    }
    
    
    [numthreads(num_threads_x, num_threads_y, 1)]
    void CSMain(uint3 Gid : SV_DispatchThreadID, uint3 Tid : SV_GroupThreadID)
    {
    	int i, j, k, l;
    
    	float3 total_color = float3(0.0f, 0.0f, 0.0f);
    	float total_weight = 0.0f;
    	float3 c;
    
    	for(i = Tid.y; i < num_threads_y * 2; i += num_threads_y) {	
    		for(j = Tid.x; j < num_threads_x * 2; j += num_threads_x) {	
    			c = Input.Load(clamp(int3(Gid.x + j - Tid.x - kernel_half * 2, Gid.y + i - Tid.y - kernel_half * 2, 0), int3(0, 0, 0), int3(g_imagesize.x, g_imagesize.y, 0)));
    g_imagesize.x, (Gid.y + i - Tid.y - kernel_half * 2 + 0.5f) / g_imagesize.y));
    			c *= 255.0f;
    			int r = c.r;
    			int g = c.g;
    			int b = c.b;
    			sharedc[i * 48 + j] = (r << 16) + (g << 8) + b;
    		}
    	}
    
    	GroupMemoryBarrierWithGroupSync();
    
    	i = -kernel_half;
    	j = -kernel_half;
    	[loop]
    	for(k = 0; k < (kernel_half * 2 + 1) * (kernel_half * 2 + 1); k++) {
    		float total;
    		float3 c1, c2, cd, cp;
    
    		int y1 = (i + Tid.y + kernel_half) * 48 + j + Tid.x + kernel_half;
    		int y2 = (Tid.y + kernel_half) * 48 + Tid.x + kernel_half;			
    		
    		c1 = transform_color(sharedc[y1]);
    		c2 = transform_color(sharedc[y2]);
    		cd = (c2 - c1);
    		dots[Tid.y * 48 + Tid.x] = dot(cd, cd);
    		
    		c1 = transform_color(sharedc[y1 + num_threads_x]);
    		c2 = transform_color(sharedc[y2 + num_threads_x]);
    		cd = (c2 - c1);
    		dots[Tid.y * 48 + Tid.x + num_threads_x] = dot(cd, cd);		// no matter, it's large enough
    		
    		c1 = transform_color(sharedc[y1 + num_threads_y * 48]);
    		c2 = transform_color(sharedc[y2 + num_threads_y * 48]);
    		cd = (c2 - c1);
    		dots[(Tid.y + num_threads_y) * 48 + Tid.x] = dot(cd, cd);	// no matter, it's large enough
    			
    		c1 = transform_color(sharedc[y1 + num_threads_y * 48 + num_threads_x]);
    		c2 = transform_color(sharedc[y2 + num_threads_y * 48 + num_threads_x]);
    		cd = (c2 - c1);
    		dots[(Tid.y + num_threads_y) * 48 + Tid.x + num_threads_x] = dot(cd, cd);
    		
    		GroupMemoryBarrierWithGroupSync();
    				
    		float d1 = 0.0f, d2 = 0.0f;
    		[unroll]
    		for(l = 0; l < kernel_half; l++) {
    			d1 += dots[Tid.y * 48 + Tid.x + l] * gaussian[l];
    			d2 += dots[Tid.y * 48 + Tid.x + l + kernel_half] * gaussian[l + kernel_half];
    		}
    		
    		dots2[Tid.y * num_threads_x + Tid.x] = d1 + d2 + dots[Tid.y * 48 + Tid.x + kernel_half * 2] * gaussian[kernel_half * 2];
    
    		d1 = 0.0f;
    		d2 = 0.0f;
    		[unroll]
    		for(l = 0; l < kernel_half; l++) {
    			d1 += dots[(Tid.y + num_threads_y) * 48 + Tid.x + l] * gaussian[l];
    			d2 += dots[(Tid.y + num_threads_y) * 48 + Tid.x + l + kernel_half] * gaussian[l + kernel_half];
    		}
    			
    		dots2[(Tid.y + num_threads_y) * num_threads_x + Tid.x] = d1 + d2 + dots[(Tid.y + num_threads_y) * 48 + Tid.x + kernel_half * 2] * gaussian[kernel_half * 2];
    		
    		GroupMemoryBarrierWithGroupSync();
    
    		d1 = 0.0f;
    		d2 = 0.0f;
    		
    		[unroll]
    		for(l = 0; l < kernel_half; l++) {
    			d1 += dots2[(Tid.y + l) * num_threads_x + Tid.x] * gaussian[l];
    			d2 += dots2[(Tid.y + l + kernel_half) * num_threads_x + Tid.x] * gaussian[l + kernel_half];
    		}
    		
    		d1 += dots2[(Tid.y + kernel_half * 2) * num_threads_x + Tid.x] * gaussian[kernel_half * 2];
    
    		total = exp(-(d1 + d2) * g_sigma * (1 / 255.0f) * (1 / 255.0f));
    		
    		cp = transform_color(sharedc[(i + Tid.y + kernel_half * 2) * 48 + j + Tid.x + kernel_half * 2]);
    			
    		total_weight += total;
    		total_color += total * cp;
    		
    		j++;
    		if(j > kernel_half) {
    			i++;
    			j = -kernel_half;
    		}
    	}
    
    	float3 vColor = total_color / total_weight;
    
    	if(Gid.x < g_imagesize.x && Gid.y < g_imagesize.y) {
    		float3 vc = vColor + 0.5;
    		vc = clamp(vc, 0, 255);
    		uint x = vc.x;
    		uint y = vc.y;
    		uint z = vc.z;
    		uint c = z + (y << 8) + (x << 16);
    		Result.Store((Gid.y * g_imagesize.x + Gid.x) * 4, c);
    	}
    }
    
    I'll post the host codes later. I think it's pretty fun to play with DirectCompute as it's quite different from OpenCL or CUDA. :)

    [EDIT] I just thought about something: with these redundant computation in place, it's now much easier to use the symmetry of the dot operations (because dot(a-b) is the same as dot(b-a)). This has the potential to cut the computation in half.
     
  8. ryta1203

    Newcomer

    Joined:
    Sep 3, 2009
    Messages:
    40
    Likes Received:
    0

    You have a lot of "if" branches, these can be gotten rid of (if they aren't already by the compiler) by turning control flow to data flow.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Nice step forwards. I noticed the following rogue line in the code you've pasted:

    g_imagesize.x, (Gid.y + i - Tid.y - kernel_half * 2 + 0.5f) / g_imagesize.y));

    which I've just ignored.

    The main loop is considerably shorter now, only 73 cycles and only 6 clauses (versus 97 cycles and 14 clauses of the first-posted code). (All this subject to my using way out-of-date D3D and ATI compilers that are bundled inside GPUSA.)

    But you kept the [unroll] pragmas. Doing so means there's still only 1 workgroup per SIMD because the register allocation is 36 (6 or 7 hardware threads).

    I calculate 14336 bytes of shared memory per workgroup, so in theory 2 workgroups should fit. If I remove the [unroll] pragmas the register allocation is 24 which means 10 hardware threads. Since 2 workgroups only use 8 hardware threads, that should fit.

    This looped version is 79 cycles and 6 clauses. So slightly "slower". So the remaining question is whether increased latency-hiding actually helps, or whether 4 hardware threads per SIMD is enough.

    Interestingly the ISA without [unroll] is unrolled anyway. The D3D assembly contains the two small loops, but the ATI compiler decides to unroll anyway. But does so differently, with the considerably lower register allocation (24 instead of 36) and the slightly longer loop.

    So, ahem, I've got no idea which is better...

    Jawed
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I really don't know much about D3D TEX formats, but shouldn't it be possible to load the picture as uint formatted data? Seems kinda bizarre to, apparently, have the picture encoded as float clamped 0...1 and then convert it to int and then pack to uint for storage in LDS. Minor detail, I know...

    Isn't there a resource view in D3D that just gives 32-bits?

    (On the other hand, speaking as an image-processing nazi: all this stuff should be done with linear math - the picture is presumably in gamma 2.2 space, but should be converted to floating point linear 0...1 - in which case you'd want to use 10-bit packed RGB format instead of 8-bit.)

    Jawed
     
  11. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,774
    Likes Received:
    156
    Location:
    Taiwan
    Yes, but I'm not sure 10bits is good enough though. Also, to be purest, the dot3 operation should also be weighted by Y coefficients, too.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I guess 10-12-10 would be fine.
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Sigh, I missed the unroll on the first loop :oops:

    That makes it 20 registers and 83 cycles. But this brings up an interesting point: since less than about 31 registers can't add any extra latency-hiding (as more than 8 hardware threads is pointless) it's possible to tweak the use of [unroll] to find the shortest main loop. i.e. find which of the 8 variations is best :roll:

    The trouble with this kind of optimisation is that you're at the mercy of compilers. You can't be sure that it won't just disappear with a future Catalyst or DX SDK :sad:

    I dare say even CUDA doesn't insulate the developer fully. It's not an item that comes up often in the list of GPGPU cons...

    Something like this benchmark suite at least provides a hint of progress or regression on these terms.

    Now all we need is for a driver to do a kernel replacement :shock:

    Jawed
     
  14. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,774
    Likes Received:
    156
    Location:
    Taiwan
    To my understanding, the HLSL compiler likes to unroll loops if it knows the exactly loop times. This applies for all loops (including the large one), so I have to put a [loop] before the large one to prevent it unrolling that.

    CUDA has a special way to limit the amount of registers used by a kernel, but I don't think that's a very good solution though. It's simply too "architecture dependent." However, apart from the register usage issue, there are still many architecture dependent issues in GPGPU. For example, there is no way to get the size of a warp (or wavefront) in OpenCL (NVIDIA provides a special device query extension to query for warp size, but no local memory bank informations). Some optimizations may depend on these data.

    Right now, fortunately, there are only two major implementations of GPGPU, so it's still possible to optimize specifically for both of them. However, as the number of devices grows, this could evolve into a serious problem.

    [EDIT] There is a newer version of GPU ShaderAnalyzer, version 1.54.
    I also did a simple "reduced symmetry" version, which should increase ALU packing rate (although the actual amount of computation is not really reduced that much). Now it takes 39ms to run. Unfortunately, the resulting image is a bit different from the original version (many pixels are differed by 1). I don't know if it's caused by different computation order or maybe some bugs (most likely).

    Code:
    // NLM shader
    // Copyright(c) 2009 Ping-Che Chen 
    
    #define num_threads_x 16
    #define num_threads_y 16
    #define kernel_half 3
    
    
    Texture2D Input : register(t0);
    SamplerState PointSampler : register(s0);
    RWByteAddressBuffer Result : register(u0);
    
    
    cbuffer cb0
    {
    	float g_sigma;
    	int2 g_imagesize;
    }
    
    
    static const float gaussian[7] = {
    	0.1062888f, 0.1403214f, 0.1657702f, 0.1752401f, 0.1657702f, 0.1403214f, 0.1062888f
    };
    
    
    groupshared uint sharedc[(num_threads_y * 2) * 48];
    groupshared float dots[(num_threads_y * 2) * 48];
    groupshared float2 dots2[(num_threads_y * 2) * 16];
    
    
    float3 transform_color(uint c)
    {
    	int r = (c >> 16) & 0xff;
    	int g = (c >> 8) & 0xff;
    	int b = c & 0xff;
    	return float3(r, g, b);
    }
    
    
    [numthreads(num_threads_x, num_threads_y, 1)]
    void CSMain(uint3 Gid : SV_DispatchThreadID, uint3 Tid : SV_GroupThreadID)
    {
    	int i, j, k, l;
    
    	float3 total_color = float3(0.0f, 0.0f, 0.0f);
    	float total_weight = 0.0f;
    	float3 c;
    
    	for(i = Tid.y; i < num_threads_y * 2; i += num_threads_y) {	
    		for(j = Tid.x; j < num_threads_x * 2; j += num_threads_x) {	
    			c = Input.Load(clamp(int3(Gid.x + j - Tid.x - kernel_half * 2, Gid.y + i - Tid.y - kernel_half * 2, 0), int3(0, 0, 0), int3(g_imagesize.x, g_imagesize.y, 0)));
    			c *= 255.0f;
    			int r = c.r;
    			int g = c.g;
    			int b = c.b;
    			sharedc[i * 48 + j] = (r << 16) + (g << 8) + b;
    		}
    	}
    
    	GroupMemoryBarrierWithGroupSync();
    
    	i = -kernel_half;
    	j = -kernel_half;
    	[loop]
    	for(k = 0; k < (kernel_half * 2 + 1) * (kernel_half * 2 + 1) / 2; k++) {
    		float2 total;
    		float3 c1, c2, cd, cp;
    
    		int y1 = (i + Tid.y + kernel_half) * 48 + j + Tid.x + kernel_half;
    		int y2 = (Tid.y + kernel_half) * 48 + Tid.x + kernel_half;			
    		
    		c1 = transform_color(sharedc[y1]);
    		c2 = transform_color(sharedc[y2]);
    		cd = (c2 - c1);
    		dots[Tid.y * 48 + Tid.x] = dot(cd, cd);
    		
    			c1 = transform_color(sharedc[y1 + num_threads_x]);
    			c2 = transform_color(sharedc[y2 + num_threads_x]);
    			cd = (c2 - c1);
    			dots[Tid.y * 48 + Tid.x + num_threads_x] = dot(cd, cd);		// no matter, it's large enough
    		
    			c1 = transform_color(sharedc[y1 + num_threads_y * 48]);
    			c2 = transform_color(sharedc[y2 + num_threads_y * 48]);
    			cd = (c2 - c1);
    			dots[(Tid.y + num_threads_y) * 48 + Tid.x] = dot(cd, cd);	// no matter, it's large enough
    			
    				c1 = transform_color(sharedc[y1 + num_threads_y * 48 + num_threads_x]);
    				c2 = transform_color(sharedc[y2 + num_threads_y * 48 + num_threads_x]);
    				cd = (c2 - c1);
    				dots[(Tid.y + num_threads_y) * 48 + Tid.x + num_threads_x] = dot(cd, cd);
    		
    		GroupMemoryBarrierWithGroupSync();
    				
    		float2 d1 = 0.0f;
    		float2 d2 = 0.0f;
    		for(l = 0; l < kernel_half; l++) {
    			d1.x += dots[Tid.y * 48 + Tid.x + l] * gaussian[l];
    			d2.x += dots[Tid.y * 48 + Tid.x + l + kernel_half] * gaussian[l + kernel_half];
    			d1.y += dots[(Tid.y - i) * 48 + Tid.x - j + l] * gaussian[l];
    			d2.y += dots[(Tid.y - i) * 48 + Tid.x - j + l + kernel_half] * gaussian[l + kernel_half];
    		}
    		
    		dots2[Tid.y * num_threads_x + Tid.x] = d1 + d2 + float2(dots[Tid.y * 48 + Tid.x + kernel_half * 2] * gaussian[kernel_half * 2], dots[(Tid.y - i) * 48 + Tid.x - j + kernel_half * 2] * gaussian[kernel_half * 2]);
    
    		d1 = 0.0f;
    		d2 = 0.0f;
    		for(l = 0; l < kernel_half; l++) {
    			d1.x += dots[(Tid.y + num_threads_y) * 48 + Tid.x + l] * gaussian[l];
    			d2.x += dots[(Tid.y + num_threads_y) * 48 + Tid.x + l + kernel_half] * gaussian[l + kernel_half];
    			d1.y += dots[(Tid.y - i + num_threads_y) * 48 + Tid.x - j + l] * gaussian[l];
    			d2.y += dots[(Tid.y - i + num_threads_y) * 48 + Tid.x - j + l + kernel_half] * gaussian[l + kernel_half];
    		}
    			
    		dots2[(Tid.y + num_threads_y) * num_threads_x + Tid.x] = d1 + d2 + float2(dots[(Tid.y + num_threads_y) * 48 + Tid.x + kernel_half * 2] * gaussian[kernel_half * 2], dots[(Tid.y - i) * 48 + Tid.x - j + kernel_half * 2] * gaussian[kernel_half * 2]);
    		
    		GroupMemoryBarrierWithGroupSync();
    
    		d1 = 0.0f;
    		d2 = 0.0f;		
    		for(l = 0; l < kernel_half; l++) {
    			d1 += dots2[(Tid.y + l) * num_threads_x + Tid.x] * gaussian[l];
    			d2 += dots2[(Tid.y + l + kernel_half) * num_threads_x + Tid.x] * gaussian[l + kernel_half];
    		}
    		
    		d1 += dots2[(Tid.y + kernel_half * 2) * num_threads_x + Tid.x] * gaussian[kernel_half * 2];
    
    		total = exp(-(d1 + d2) * g_sigma * (1 / 255.0f) * (1 / 255.0f));
    		
    		cp = transform_color(sharedc[(i + Tid.y + kernel_half * 2) * 48 + j + Tid.x + kernel_half * 2]);
    			
    		total_weight += total.x;
    		total_color += total.x * cp;
    		
    		cp = transform_color(sharedc[(Tid.y - i + kernel_half * 2) * 48 - j + Tid.x + kernel_half * 2]);	
    		total_weight += total.y;
    		total_color += total.y * cp;
    
    		j++;
    		if(j > kernel_half) {
    			i++;
    			j = -kernel_half;
    		}
    	}
    
    	total_weight += 1.0f;
    	total_color += transform_color(sharedc[(Tid.y + kernel_half * 2) * 48 + Tid.x + kernel_half * 2]);
    
    	float3 vColor = total_color / total_weight;
    
    	if(Gid.x < g_imagesize.x && Gid.y < g_imagesize.y) {
    		float3 vc = vColor + 0.5;
    		vc = clamp(vc, 0, 255);
    		uint x = vc.x;
    		uint y = vc.y;
    		uint z = vc.z;
    		uint c = z + (y << 8) + (x << 16);
    		Result.Store((Gid.y * g_imagesize.x + Gid.x) * 4, c);
    	}
    }
    
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    GPUSA appears to expose DX compiler options because the D3D assembly changes with different Flow Control settings. I've got no idea how an application accesses that stuff, if at all. It may not be worth worrying about because the pragmas are more precise...

    I think the DX compiler messages you were struggling with, earlier in the thread (post 27 etc.) are caused by the compiler seeing the GroupMemoryBarrierWithGroupSync() and trying to rationalise it by unrolling. Though because it's buggy (nesting is too complex) who knows exactly...

    When you omit the [unroll] pragma from the 3 little loops it seems the default behaviour is to retain them as loops. Though this could be just because of the August 09 SDK.

    That option might be more to do with the architecture having so few registers anyway. Still too early to know how things will go with Fermi, now that spill is supposedly workable...

    I suspect hardware thread size will be added in to OpenCL.

    I dare say Fermi's enough of a split that there's now 3.

    Cool, just installed it.

    I don't understand why AMD doesn't allow this and SKA to use the installed driver (if any) as well as what it's packaged with. It contains 10.3, but 10.4 was released last week.

    39ms is a quite a jump in performance, nice. The amusing thing is that the 3 little loops are unrolled, the register allocation is 19 and the main loop is 418 cycles, that's 5.7x longer :shock:

    (At least, according to GPUSA 1.54...)

    The main loop is 24 iterations instead of 49. Is that causing the difference in the result or have I missed something? (Obviously it's intentionally "half" but I'm unclear on the right way to treat the "centre".)

    I notice that there are 8 GROUP_BARRIER instructions in the ISA. This implies the main loop is unrolled by a factor of 4 by the ATI compiler (since the original loop has 2 of these instructions). The main loop is "LOOP_NO_AL i1" which appears to get its iteration count from some hidden register, so I can't confirm this directly.

    Overall it seems a lack of latency-hiding (latency caused by clause switching: ~50 cycles per switch) with the original was crushing performance. I suppose the 4-fold unroll has an effect too. Not sure what else it could be.

    Anyway, 39ms is still disappointingly slow... I guess more hardware threads are needed. That's only feasible if total shared memory falls below 10920 bytes per workgroup (i.e. allowing 3 workgroups per SIMD).

    Jawed
     
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Ah, I've just realised that with the main loop being 10032 cycles as compared with 3577 cycles (for 53ms) along with the 4x unroll, we're effectively looking at 10032 versus 14308 cycles (or 2508 versus 3577 cycles), which is ~43% faster.

    Since it's ~36% faster in reality, I suspect that means latency-hiding is not making much of a difference - 4 hardware threads are mostly enough, not a disaster.

    Jawed
     
  17. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,774
    Likes Received:
    156
    Location:
    Taiwan
    Could be, but it can be a bit complex though. For example, NVIDIA's extension now returns warp size as 32 for GT200, but its behavior is more like 16 (i.e. half warp). There are still reasons to call the warp size as 32, but some performance characteristics are more close to 16 rather than 32 (such as branch divergence).

    Yeah. To my understanding, Fermi does not have half-warp anymore. Furthermore, its shared memory has 32 banks, just like Evergreen.


    I don't know either. The OpenCL analyzer seems to be using the installed Stream SDK to do the compiling though. Of course, GPU shader analyzer compiles much faster than real DX11 compiler... it still takes more than one minute to compile the kernel with DX11.

    It would be nice if the shader analyzer takes compiled binary though. I suspect that it's possible to disassemble the compiled binary then feed it into the analyzer.

    The center is treated after the loop. Since the center case is always 1 (the differences are zero as it compared to itself, so exp(0) = 1), it can be easily added afterward.

    Right now this kernel does around 25% more computation than actually required. Unfortunately, it's slightly less than 25%, so to further reduce that would need to reduce the amount of threads (such as going from current 16x16 to 8x8), but that would create more redundant computation.

    Another currently "redundant" computation is in the Gaussian filter, which is now performed twice for each symmetry. In theory, this can be avoided, but that would increase the amount of redundant computation (again) and they are roughly the same amount of calculation compared to current version.

    After these modifications, I'm now interested to see what would happen if these modifications are done in the OpenCL version. This looks like an interesting experiment. :)
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I think this is part of NVidia's effort to prevent developers getting comfortable with the idea of a 16 when the future is 32. i.e. from the point of view of G80/GT200 "the future is 32, so don't get too enamoured with 16."

    So, hopefully there won't be a similar "split" with Fermi: size for some things and size/2 for other things. Actually, hmm, thinking about it, some kind of split seems inevitable...

    With Evergreen there's that split: 64 work items but 32 banks...

    Isn't that merely to access LLVM related stuff?

    That's strange.

    Seems AMD is working towards providing compiled binaries in order to assuage developers' worries about packaging "plainly readable" kernels with their OpenCL applications, so that prolly won't happen.

    Ah, "cp" is centre pixel.

    So how does the 24-iteration loop "skip" the centre?

    I have to admit I've been wondering about 8x8. Though I'm doubtful it's faster. Trouble is, you can't tell till you try it.

    The real question is which "schedules" best on the 5-lanes. Again, you don't know till you try.

    Yeah, that's what I was thinking too. Nothing about this latest kernel seems "anti" GT200. And of course Fermi's still an unknown quantity...

    Jawed
     
  19. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Got a bit busy and didn't have a chance to take a good look at this, but looks like you guys have made good progress! (Should we branch this off into another thread maybe?)

    If you can provide some of the surrounding code/binary I can give it a run on a GTX 480.
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Currently this kernel leaves the TEX cache and fetch units idle. Since there's a huge amount of arithmetic I think it makes sense to leave the original bitmap's pixels in the input resource and fetch them as needed. There's only 11 pixel fetches per iteration of the main loop.

    So, ALU:TEX should be high enough to obviate any concerns over being TEX bottlenecked.

    The synchs, in theory, have an effect on ALU:TEX, since they act to make the kernel behave almost like separate kernels strung together in sequence. This means ALU:TEX in each sector (bounded by synchs) of the kernel acts as a semi-independent factor, and where it's lowest it could pose a bottleneck for the kernel as a whole. Though 8 hardware threads (2 workgroups) may well be enough to make this moot, if 8 can be maintained.

    This would also reduce the per-workgroup shared memory allocation a little, though I doubt that's going to have any direct use.

    This is probably an ATI-specific tweak for two reasons:
    1. register allocation will increase - a TEX clause (if more than one instruction) needs a clump of registers
    2. ATI LDS usage costs ALU cycles - i.e. generates latency that can't be hidden
    Point 2 is an ATI-specific problem and is part of the motivation for suggesting this approach. NVidia can hide the latency of a single use of a shared memory operand. On ATI you really want to minimise the count of LDS operations (read or write) or at least balance them along with register-allocation and fetch-count (and other latency-inducing stuff). (NVidia only really cares if a shared memory operand is used multiple times: then the low bandwidth, in comparison with registers, can be an issue.)

    Since AMD has enabled image support in OpenCL, this technique should also work there. But I suspect CS5 will work better (more mature), so it's prolly the best place to start testing this.

    Jawed
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...