DX11 DirectCompute Buddhabrot Demo

Discussion in 'GPGPU Technology & Programming' started by fellix, Mar 31, 2010.

  1. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,430
    Likes Received:
    309
    Location:
    Varna, Bulgaria
  2. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Interesting. It's basically a scattered read-modify-write test.

    I assume Fermi would have a big lead in this, and it could be colossal if you render at a low enough resolution to fit in the L2.
     
  3. Broken Hope

    Regular

    Joined:
    Jul 13, 2004
    Messages:
    483
    Likes Received:
    1
    Location:
    England
    Getting around 85-90 fps on my 5870, on the website itself he says he's only getting around 45 fps on a 5870 I'm guessing there is something wrong with his card :p
     
  4. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,778
    Likes Received:
    441
    Location:
    Torquay, UK
    Same here :smile:

    Pretty pictures I must say! CPU is quite slow in this ...
     
  5. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,430
    Likes Received:
    309
    Location:
    Varna, Bulgaria
    Sadly, the CPU path is single-threaded. :???:
    Extrapolating linearly with core count, my Q9450 @ 3608MHz would be just 1/2 the performance of HD5870.
     
  6. yakiimo02

    Newcomer

    Joined:
    Apr 4, 2010
    Messages:
    2
    Likes Received:
    0
    Location:
    Tokyo, Japan
    Hi, thanks for DLing my Buddhabrot program. I wrote a summary of the implementation on my blog.

    http://www.yakiimo3d.com/2010/03/29/dx11-directcompute-buddhabrot-nebulabrot/

    Both GPU and CPU implementation are non-optimized. CPU implementation is non-SIMD, non-multithreaded. Buddhabrot is a simple extension of the Mandelbrot fractal, but unlike the Mandelbrot which is pure computation, the Buddhabrot uses an output increment buffer in order to plot the path of the escape of a complex number not in the Mandelbrot set and extensive scattered memory writes (specifically IterlockedAdd atomic ops to an integer UAV buffer) are performed on global memory.

    Worst case in sample program is 9999+999+99 writes (those are iteration maxes-1 for each rgb color) per compute shader thread, but I'm guessing actual write is much lower. I guess the program demonstrates that scattered memory writes to global memory is somewhat slow for GPGPU (since Mandelbrot program are like 50x+ times faster than single-threaded, non-SIMD CPU programs.) I don't really see way to involve shared local memory for the Buddhabrot either and that's why I don't use it.
     
  7. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,715
    Likes Received:
    1,929
    Location:
    Germany
    Yeah, I've noticed the same.

    Anyway: GTX 480 gets about 142is Fps in default settings for Buddabrot.
     
  8. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,430
    Likes Received:
    309
    Location:
    Varna, Bulgaria
  9. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,778
    Likes Received:
    441
    Location:
    Torquay, UK
    Almost twice as fast as HD5870! Nice :smile:
     
  10. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,430
    Likes Received:
    309
    Location:
    Varna, Bulgaria
    This was done on an overclocked GTX480 -- 1675/838 MHz.
     
  11. SuperCow

    Newcomer

    Joined:
    Sep 12, 2002
    Messages:
    106
    Likes Received:
    4
    Location:
    City of cows
    I had a look at the source and did a few things to optimize performance for a 5870, going from about 80 fps to 140 fps or so.

    The first thing I did was to try to improve the performance of the Iterate() function by changing the way the dynamic loop is done in BuddhabrotCS.hlsl:

    Original code:
    Code:
    
    	for( int nIteration=0; nIteration<nMaxIteration; ++nIteration )
    	{
    		float zRe2 = zRe*zRe;
    		float zIm2 = zIm*zIm;
    
    		/// check whether absolute value of z has gotten bigger than 2
    		if( zRe2 + zIm2 > 4.0 ) {
    			return nIteration;
    		}
    
    		// calculate next z(n)
    		zIm = 2.0*zRe*zIm + cIm;
    		zRe = zRe2 - zIm2 + cRe;
    	}
    
    	// complex number did not escape to infinity 
    	return nMaxIteration;
    
    New code:
    Code:
    
        // Doing the loop this way improves performance for an equivalent result
        int nIteration=0;
        float zRe2 = zRe*zRe;
        float zIm2 = zIm*zIm;
        while (nIteration<nMaxIteration && (zRe2 + zIm2 <= 4.0) )
        {
    		// calculate next z(n)
    		zIm = 2.0*zRe*zIm + cIm;
    		zRe = zRe2 - zIm2 + cRe;
    		
    		zRe2 = zRe*zRe;
    		zIm2 = zIm*zIm;
    		
    		nIteration++;
    	}
        
        return nIteration;
    
    Then I tried to simplify the code in the IterateAndPlot() function as it was quite branching-heavy, which is usually not a good thing as it can impact vectorization on ATI GPUs, and also it can consumes precious GPRs.

    Original code:

    Code:
    if( nX >=0 && nY >=0 && nX < nWidth && nY < nHeight ) 
    {
    	int nIndex = nX + nWidth*nY;
    
    	if( nIteration < maxIterationsR ) {
    		InterlockedAdd( iterationsBuffer[ nIndex ].x, 1 );
    	}
    	if( nIteration < maxIterationsG ) {
    		InterlockedAdd( iterationsBuffer[ nIndex ].y, 1 );
    	}
    	if( nIteration < maxIterationsB ) {
    		InterlockedAdd( iterationsBuffer[ nIndex ].z, 1 );
    	}
    }
    New code:

    Code:
    int nScissorTestSucceed = ( nX >=0 && nY >=0 && nX < nWidth && nY < nHeight );
    			
    int nIndex = nX + nWidth*nY;
    			
    InterlockedAdd( iterationsBuffer[ nIndex ].x, nScissorTestSucceed * ( nIteration < maxIterationsR ) );
    InterlockedAdd( iterationsBuffer[ nIndex ].y, nScissorTestSucceed * ( nIteration < maxIterationsG ) );
    InterlockedAdd( iterationsBuffer[ nIndex ].z, nScissorTestSucceed * ( nIteration < maxIterationsB ) );
    
    Finally I chose a thread group size that is a nice multiple of the ATI's Cypress wavefront size (64). So instead of declaring [numthreads(10,10,1)] before the BuddhabrotCS() function I declared [numthreads(16,16,1)], and I modified the Dispatch() code in BuddhabrotCS.hlsl line 134 accordingly by going from:

    pd3dImmediateContext->Dispatch( (m_nInputWidth)/10, (m_nInputHeight)/10, 1 );

    to:

    pd3dImmediateContext->Dispatch( (m_nInputWidth)/16, (m_nInputHeight)/16, 1 );


    All in all these optimizations gave me a 50%+ increase in performance on my Cypress card. Note that ideal values for thread group can vary from GPU to GPU so the ideal setting may be different for your card.
    Despite these optimizations the ALU utilization is still not great on Cypress cards (50% or so) so it will be hard to achieve a theoretical maximum with such algorithms, unless they're modified to take advantage of vectorization.
     
  12. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,393
    Likes Received:
    378
    Location:
    New York
    Impressive :) Nice example of thinking outside the box.
     
  13. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    SuperCow, arrescate!:smile:
     
  14. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,430
    Likes Received:
    309
    Location:
    Varna, Bulgaria
    I couldn't find this line entry in the HLSL file, but even leaving the thread group size unmodified I got 50% speedup for Cypress.
     
  15. SuperCow

    Newcomer

    Joined:
    Sep 12, 2002
    Messages:
    106
    Likes Received:
    4
    Location:
    City of cows
    Sorry, was a typo. This should have read:

    I modified the Dispatch() code in BuddhabrotCS.cpp line 134 accordingly by going from:

    pd3dImmediateContext->Dispatch( (m_nInputWidth)/10, (m_nInputHeight)/10, 1 );

    to:

    pd3dImmediateContext->Dispatch( (m_nInputWidth)/16, (m_nInputHeight)/16, 1 );
     
  16. Neb

    Neb Iron "BEAST" Man
    Legend

    Joined:
    Mar 16, 2007
    Messages:
    8,391
    Likes Received:
    3
    Location:
    NGC2264
    To the rescue?
     
  17. SuperCow

    Newcomer

    Joined:
    Sep 12, 2002
    Messages:
    106
    Likes Received:
    4
    Location:
    City of cows
    I was wondering the same thing myself :)
     
  18. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,716
    Likes Received:
    88
    Location:
    Taiwan
    I tried to vectorize it by cutting the screen by half and compute both half in one run:

    Code:
    	float2 random = randomInput.Load( uint3(DTid.x, DTid.y, 0) ).xy;
    	float2 zRe;
    	float2 zIm;
    	zRe.x = random.x;
    	zIm.x = random.y;
    	random = randomInput.Load( uint3(DTid.x + nInputWidth / 2, DTid.y, 0) ).xy;
    	zRe.y = random.x;
    	zIm.y = random.y;
    Then I vectorize the function Iterate:

    Code:
    int2 Iterate2( float2 zRe, float2 zIm ) 
    {
    	float2 cRe = zRe;
    	float2 cIm = zIm;
    
        // Doing the loop this way improves performance for an equivalent result
        int2 nIteration=int2(0, 0);
        float2 zRe2 = zRe*zRe;
        float2 zIm2 = zIm*zIm;
        int2 cond = (sign(nIteration - nMaxIteration) & sign(-step(zRe2 + zIm2, float2(4.0, 4.0))));
        while (any(cond))
        {
    		// calculate next z(n)
    		zIm = float2(2.0, 2.0)*zRe*zIm + cIm;
    		zRe = zRe2 - zIm2 + cRe;
    		
    		zRe2 = zRe*zRe;
    		zIm2 = zIm*zIm;
    		
    		nIteration += -cond;
    		
    		cond = (sign(nIteration - nMaxIteration) & sign(-step(zRe2 + zIm2, float2(4.0, 4.0))));
    	}
        
        return nIteration;
    }
    I also tried to vectorize IterateAndPlot but it's not successful, so the main function looks like:

    Code:
    [numthreads(16, 16, 1)]
    void BuddhabrotCS( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )
    {
    	float x = DTid.x;
    	float y = DTid.y;
    
    	int nInputIndex = DTid.x+DTid.y*nInputWidth;
    
    	float2 random = randomInput.Load( uint3(DTid.x, DTid.y, 0) ).xy;
    	float2 zRe;
    	float2 zIm;
    	zRe.x = random.x;
    	zIm.x = random.y;
    	random = randomInput.Load( uint3(DTid.x + nInputWidth / 2, DTid.y, 0) ).xy;
    	zRe.y = random.x;
    	zIm.y = random.y;
    		
    	// check whether sample complex number is in the mandelbrot set.
    	int2 nIterationEscape;
    	nIterationEscape  = Iterate2(zRe, zIm);
    	
    	if(nIterationEscape.x != nMaxIteration) {
    		IterateAndPlot( zRe.x, zIm.x, nOutputWidth, nOutputHeight, nIterationEscape.x );
    	}
    	
    	if(nIterationEscape.y != nMaxIteration) {
    		IterateAndPlot( zRe.y, zIm.y, nOutputWidth, nOutputHeight, nIterationEscape.y );
    	}
    }
    
    then in BuddhabrotCS.cpp:

    Code:
    	// run compute shader
    	pd3dImmediateContext->Dispatch( (m_nInputWidth)/32, (m_nInputHeight)/16, 1 );
    because only half amount of kernels is required.

    By doing this, the fps goes from around 125fps to around 147 fps on my Radeon 5850. I think that if one optimizes Iterate function further it could be even faster.

    Vectorize Iterate to 4D is not faster than the original version, though.
     
  19. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I'm surprised that you two were able to optimize so much because I thought it was scatter write limited. Four memory channels can only do 1.2 billion random read-modify-writes per second, right? Cypress' ALUs can execute 200 times as many vec5 cycles per second. AFAICS, the algorithm does two Mandelbrot iterations per InterlockedAdd.
     
  20. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,716
    Likes Received:
    88
    Location:
    Taiwan
    I think you are right, it's scatter write limited. That's probably the reason why vectorize the "IterateAndPlot" function does not work well, because the vectorized version tries to do two sets of InterlockedAdd at once (although there are three InterlockedAdd in the original kernel, they write to adjacent addresses so it's fine). By writing two sets at once, the GPU is not able to insert more computations in the pipeline, therefore more pipeline stalls.

    Vectorizing Iterate function works because it has no InterlockedAdd, so the execution flow is something like:

    [full Mandelbrot computation] + [one step of Mandelbrot] + [InterlockedAdd] + [one step of Mandelbrot] + [InterlockedAdd] ...

    Since one step of Mandelbrot is not enough to cover the pipeline stall cased by InterlockedAdd, optimizing the full Mandelbrot computation (which has probably still more computation than the pipeline stall cased by InterlockedAdd because the max iteration is 10000) works. Of course, this is just my speculation.

    If this is correct, then maybe a better optimization idea is try to mix the IterateAndPlot function with another full Mandelbrot computation in order to cover as much as pipeline stall as possible. For example, right now a kernel runs only one sample (or two samples in the vectorized version). It should be modified to handle tens or hundreds of samples, and when plotting the path of the first sample, it would compute the full Mandelbrot of the second sample, and so on. Then hopefully the cost of the full Mandelbrot can be hidden by the IterateAndPlot function.
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...