DX11 DirectCompute Buddhabrot Demo

Discussion in 'GPGPU Technology & Programming' started by fellix, Mar 31, 2010.

  1. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
  2. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    That would depend on the scoreboarding implementation though. I'm not familiar with Evergreen's implemention. :p

    I think we probably don't need to use a real append/consume buffer though. A private array for each thread may be good enough. Or even better, a private shared array for each work group (the append/consume process can be done with atomic operations on shared memory). How do you think?
     
  3. g__day

    Regular

    Joined:
    Jun 22, 2002
    Messages:
    580
    Likes Received:
    2
    Location:
    Sydney Australia
    That download link doesn't seem to work in either firefox or IE - has the link changed guys?
     
  4. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    Do you mean my download link? You may have to click into it, "Save Link" probably doesn't work.

    Also I tried using shared memory to simulate an append/consume buffer, but it doesn't work very well. I increased the input size from 160x160 to 320x320, but the number of work groups remains the same, so each work group will do 4 times works. A shared memory buffer the size of 16*16*8 is used for append/consume buffer. The result is around 4100k samples/s.

    Simply increase the input size to 320x320 runs at around 5700 k samples/s.

    I suspect that because there needs to be a groupsync between the Iterate part and the IterateAndPlot part, threads with faster Iterates still have to wait for other threads. It actually makes the situation worse because in the original version only threads in the same wavefront have to wait, but now all threads in the same work group have to wait. Maybe a real append/consume buffer could help?

    I also thought about another idea: in theory, the time it takes to run an IterateAndPlot is proportional to the IterateEscape parameter. That means, if we have collected all IterateEscape for each samples, it should be possible to sort them so the time spent on IterateAndPlot for each thread is more balanced.

    The code is here (shared_idx, shared_x, shared_y, and shared_escape are groupshared memory):
    Code:
    	if(GI == 0) {
    		shared_idx = 0;
    	}
    	
    	GroupMemoryBarrierWithGroupSync();
    
    	int y = DTid.y;
    	int idx;
    	for(int i = 0; i < 2; i++) {
    		int x = DTid.x;
    		for(int j = 0; j < 2; j++) {
    			float2 random = randomInput.Load( uint3(x, y, 0) ).xy;
    			float2 zRe;
    			float2 zIm;
    			zRe.x = random.x;
    			zIm.x = random.y;
    			random = randomInput.Load( uint3(x + nInputWidth / 2, y, 0) ).xy;
    			zRe.y = random.x;
    			zIm.y = random.y;
    		
    			// check whether sample complex number is in the mandelbrot set.
    			int2 nIterationEscape;
    			nIterationEscape  = Iterate2_unroll(zRe, zIm);
    			
    			if(nIterationEscape.x != nMaxIteration) {
    				InterlockedAdd(shared_idx, 1, idx);
    				shared_x[idx] = zRe.x;
    				shared_y[idx] = zIm.x;				
    				shared_escape[idx] = nIterationEscape.x;
    			}
    			
    			if(nIterationEscape.y != nMaxIteration) {
    				InterlockedAdd(shared_idx, 1, idx);
    				shared_x[idx] = zRe.y;
    				shared_y[idx] = zIm.y;
    				shared_escape[idx] = nIterationEscape.y;		
    			}
    			
    			x += nInputWidth / 4;
    		}
    		
    		y += nInputHeight / 2;
    	}
    	
    	GroupMemoryBarrierWithGroupSync();
    	
    	idx = GI;
    	while(idx < shared_idx) {
    		IterateAndPlot( shared_x[idx], shared_y[idx], nOutputWidth, nOutputHeight, shared_escape[idx]);
    		idx += 16 * 16;
    	}
     
  5. XMAN26

    Banned

    Joined:
    Feb 17, 2003
    Messages:
    702
    Likes Received:
    1
    Could someone with a GTX480 and the coding know how optimize it for the GF100 as it sems everything done so far has been to optimize it for cypress.
     
  6. Psycho

    Regular

    Joined:
    Jun 7, 2008
    Messages:
    746
    Likes Received:
    41
    Location:
    Copenhagen
    P(GTX480) * P(know how) = very small number ;)

    Most of the optimizations (except the vectorizing) are also valid for fermi, evergreen just benefits more from them (or you could say; needs them more). So I think it's mostly a matter of tweaking different sizes (and you more or less need a gf100 at hand for that).
    But sure it would be interesting to see, to get the full picture.
    This thread is already in line with our expectations that naive/generic code will usually run faster on gf100, while the tables are turned by a decent level of optimization, bringing cypress closer to it's peak rates.
     
  7. XMAN26

    Banned

    Joined:
    Feb 17, 2003
    Messages:
    702
    Likes Received:
    1
    Very true, I'd just like to see some scores from optimized code for both.
     
  8. g__day

    Regular

    Joined:
    Jun 22, 2002
    Messages:
    580
    Likes Received:
    2
    Location:
    Sydney Australia
    pcchen - no felix's original link returns a 404 error on the page with the download link. Is the latest code hosted anywhere else - link please!
     
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
  10. Mendel

    Mendel Mr. Upgrade
    Veteran

    Joined:
    Nov 28, 2003
    Messages:
    1,350
    Likes Received:
    17
    Location:
    Finland
    I'm getting a 404 Not Found when trying to download. Any mirrors?
     
  11. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
  12. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    961
    Likes Received:
    855
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...