View Full Version : DX11 DirectCompute Buddhabrot Demo
Buddhabrot generator for DC (http://www.geeks3d.com/20100330/dx11-directcompute-buddhabrot-demo/)
Mintmaster
02-Apr-2010, 16:04
Interesting. It's basically a scattered read-modify-write test.
I assume Fermi would have a big lead in this, and it could be colossal if you render at a low enough resolution to fit in the L2.
Broken Hope
02-Apr-2010, 18:48
Getting around 85-90 fps on my 5870, on the website itself he says he's only getting around 45 fps on a 5870 I'm guessing there is something wrong with his card :P
Lightman
03-Apr-2010, 11:44
Getting around 85-90 fps on my 5870, on the website itself he says he's only getting around 45 fps on a 5870 I'm guessing there is something wrong with his card :P
Same here :smile:
Pretty pictures I must say! CPU is quite slow in this ...
Sadly, the CPU path is single-threaded. :???:
Extrapolating linearly with core count, my Q9450 @ 3608MHz would be just 1/2 the performance of HD5870.
yakiimo02
04-Apr-2010, 15:16
Hi, thanks for DLing my Buddhabrot program. I wrote a summary of the implementation on my blog.
http://www.yakiimo3d.com/2010/03/29/dx11-directcompute-buddhabrot-nebulabrot/
Both GPU and CPU implementation are non-optimized. CPU implementation is non-SIMD, non-multithreaded. Buddhabrot is a simple extension of the Mandelbrot fractal, but unlike the Mandelbrot which is pure computation, the Buddhabrot uses an output increment buffer in order to plot the path of the escape of a complex number not in the Mandelbrot set and extensive scattered memory writes (specifically IterlockedAdd atomic ops to an integer UAV buffer) are performed on global memory.
Worst case in sample program is 9999+999+99 writes (those are iteration maxes-1 for each rgb color) per compute shader thread, but I'm guessing actual write is much lower. I guess the program demonstrates that scattered memory writes to global memory is somewhat slow for GPGPU (since Mandelbrot program are like 50x+ times faster than single-threaded, non-SIMD CPU programs.) I don't really see way to involve shared local memory for the Buddhabrot either and that's why I don't use it.
CarstenS
06-Apr-2010, 12:39
Getting around 85-90 fps on my 5870, on the website itself he says he's only getting around 45 fps on a 5870 I'm guessing there is something wrong with his card :P
Yeah, I've noticed the same.
Anyway: GTX 480 gets about 142is Fps in default settings for Buddabrot.
http://i162.photobucket.com/albums/t241/bobytt/Buddhabrot_DirectCompute2010-04-100.png
GTX480
Lightman
10-Apr-2010, 22:12
http://i162.photobucket.com/albums/t241/bobytt/Buddhabrot_DirectCompute2010-04-100.png
GTX480
Almost twice as fast as HD5870! Nice :smile:
This was done on an overclocked GTX480 -- 1675/838 MHz.
SuperCow
12-Apr-2010, 14:04
I had a look at the source and did a few things to optimize performance for a 5870, going from about 80 fps to 140 fps or so.
The first thing I did was to try to improve the performance of the Iterate() function by changing the way the dynamic loop is done in BuddhabrotCS.hlsl:
Original code:
for( int nIteration=0; nIteration<nMaxIteration; ++nIteration )
{
float zRe2 = zRe*zRe;
float zIm2 = zIm*zIm;
/// check whether absolute value of z has gotten bigger than 2
if( zRe2 + zIm2 > 4.0 ) {
return nIteration;
}
// calculate next z(n)
zIm = 2.0*zRe*zIm + cIm;
zRe = zRe2 - zIm2 + cRe;
}
// complex number did not escape to infinity
return nMaxIteration;
New code:
// Doing the loop this way improves performance for an equivalent result
int nIteration=0;
float zRe2 = zRe*zRe;
float zIm2 = zIm*zIm;
while (nIteration<nMaxIteration && (zRe2 + zIm2 <= 4.0) )
{
// calculate next z(n)
zIm = 2.0*zRe*zIm + cIm;
zRe = zRe2 - zIm2 + cRe;
zRe2 = zRe*zRe;
zIm2 = zIm*zIm;
nIteration++;
}
return nIteration;
Then I tried to simplify the code in the IterateAndPlot() function as it was quite branching-heavy, which is usually not a good thing as it can impact vectorization on ATI GPUs, and also it can consumes precious GPRs.
Original code:
if( nX >=0 && nY >=0 && nX < nWidth && nY < nHeight )
{
int nIndex = nX + nWidth*nY;
if( nIteration < maxIterationsR ) {
InterlockedAdd( iterationsBuffer[ nIndex ].x, 1 );
}
if( nIteration < maxIterationsG ) {
InterlockedAdd( iterationsBuffer[ nIndex ].y, 1 );
}
if( nIteration < maxIterationsB ) {
InterlockedAdd( iterationsBuffer[ nIndex ].z, 1 );
}
}
New code:
int nScissorTestSucceed = ( nX >=0 && nY >=0 && nX < nWidth && nY < nHeight );
int nIndex = nX + nWidth*nY;
InterlockedAdd( iterationsBuffer[ nIndex ].x, nScissorTestSucceed * ( nIteration < maxIterationsR ) );
InterlockedAdd( iterationsBuffer[ nIndex ].y, nScissorTestSucceed * ( nIteration < maxIterationsG ) );
InterlockedAdd( iterationsBuffer[ nIndex ].z, nScissorTestSucceed * ( nIteration < maxIterationsB ) );
Finally I chose a thread group size that is a nice multiple of the ATI's Cypress wavefront size (64). So instead of declaring [numthreads(10,10,1)] before the BuddhabrotCS() function I declared [numthreads(16,16,1)], and I modified the Dispatch() code in BuddhabrotCS.hlsl line 134 accordingly by going from:
pd3dImmediateContext->Dispatch( (m_nInputWidth)/10, (m_nInputHeight)/10, 1 );
to:
pd3dImmediateContext->Dispatch( (m_nInputWidth)/16, (m_nInputHeight)/16, 1 );
All in all these optimizations gave me a 50%+ increase in performance on my Cypress card. Note that ideal values for thread group can vary from GPU to GPU so the ideal setting may be different for your card.
Despite these optimizations the ALU utilization is still not great on Cypress cards (50% or so) so it will be hard to achieve a theoretical maximum with such algorithms, unless they're modified to take advantage of vectorization.
trinibwoy
12-Apr-2010, 14:58
Then I tried to simplify the code in the IterateAndPlot() function as it was quite branching-heavy, which is usually not a good thing as it can impact vectorization on ATI GPUs, and also it can consumes precious GPRs.
Impressive :) Nice example of thinking outside the box.
SuperCow, arrescate!:smile:
I modified the Dispatch() code in BuddhabrotCS.hlsl line 134 accordingly by going from:
pd3dImmediateContext->Dispatch( (m_nInputWidth)/10, (m_nInputHeight)/10, 1 );
to:
pd3dImmediateContext->Dispatch( (m_nInputWidth)/16, (m_nInputHeight)/16, 1 );
I couldn't find this line entry in the HLSL file, but even leaving the thread group size unmodified I got 50% speedup for Cypress.
SuperCow
13-Apr-2010, 09:54
Sorry, was a typo. This should have read:
I modified the Dispatch() code in BuddhabrotCS.cpp line 134 accordingly by going from:
pd3dImmediateContext->Dispatch( (m_nInputWidth)/10, (m_nInputHeight)/10, 1 );
to:
pd3dImmediateContext->Dispatch( (m_nInputWidth)/16, (m_nInputHeight)/16, 1 );
SuperCow, arrescate!:smile:
To the rescue?
SuperCow
13-Apr-2010, 16:23
To the rescue?
I was wondering the same thing myself :)
I tried to vectorize it by cutting the screen by half and compute both half in one run:
float2 random = randomInput.Load( uint3(DTid.x, DTid.y, 0) ).xy;
float2 zRe;
float2 zIm;
zRe.x = random.x;
zIm.x = random.y;
random = randomInput.Load( uint3(DTid.x + nInputWidth / 2, DTid.y, 0) ).xy;
zRe.y = random.x;
zIm.y = random.y;
Then I vectorize the function Iterate:
int2 Iterate2( float2 zRe, float2 zIm )
{
float2 cRe = zRe;
float2 cIm = zIm;
// Doing the loop this way improves performance for an equivalent result
int2 nIteration=int2(0, 0);
float2 zRe2 = zRe*zRe;
float2 zIm2 = zIm*zIm;
int2 cond = (sign(nIteration - nMaxIteration) & sign(-step(zRe2 + zIm2, float2(4.0, 4.0))));
while (any(cond))
{
// calculate next z(n)
zIm = float2(2.0, 2.0)*zRe*zIm + cIm;
zRe = zRe2 - zIm2 + cRe;
zRe2 = zRe*zRe;
zIm2 = zIm*zIm;
nIteration += -cond;
cond = (sign(nIteration - nMaxIteration) & sign(-step(zRe2 + zIm2, float2(4.0, 4.0))));
}
return nIteration;
}
I also tried to vectorize IterateAndPlot but it's not successful, so the main function looks like:
[numthreads(16, 16, 1)]
void BuddhabrotCS( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )
{
float x = DTid.x;
float y = DTid.y;
int nInputIndex = DTid.x+DTid.y*nInputWidth;
float2 random = randomInput.Load( uint3(DTid.x, DTid.y, 0) ).xy;
float2 zRe;
float2 zIm;
zRe.x = random.x;
zIm.x = random.y;
random = randomInput.Load( uint3(DTid.x + nInputWidth / 2, DTid.y, 0) ).xy;
zRe.y = random.x;
zIm.y = random.y;
// check whether sample complex number is in the mandelbrot set.
int2 nIterationEscape;
nIterationEscape = Iterate2(zRe, zIm);
if(nIterationEscape.x != nMaxIteration) {
IterateAndPlot( zRe.x, zIm.x, nOutputWidth, nOutputHeight, nIterationEscape.x );
}
if(nIterationEscape.y != nMaxIteration) {
IterateAndPlot( zRe.y, zIm.y, nOutputWidth, nOutputHeight, nIterationEscape.y );
}
}
then in BuddhabrotCS.cpp:
// run compute shader
pd3dImmediateContext->Dispatch( (m_nInputWidth)/32, (m_nInputHeight)/16, 1 );
because only half amount of kernels is required.
By doing this, the fps goes from around 125fps to around 147 fps on my Radeon 5850. I think that if one optimizes Iterate function further it could be even faster.
Vectorize Iterate to 4D is not faster than the original version, though.
Mintmaster
14-Apr-2010, 14:12
I'm surprised that you two were able to optimize so much because I thought it was scatter write limited. Four memory channels can only do 1.2 billion random read-modify-writes per second, right? Cypress' ALUs can execute 200 times as many vec5 cycles per second. AFAICS, the algorithm does two Mandelbrot iterations per InterlockedAdd.
I'm surprised that you two were able to optimize so much because I thought it was scatter write limited. Four memory channels can only do 1.2 billion random read-modify-writes per second, right? Cypress' ALUs can execute 200 times as many vec5 cycles per second. AFAICS, the algorithm does two Mandelbrot iterations per InterlockedAdd.
I think you are right, it's scatter write limited. That's probably the reason why vectorize the "IterateAndPlot" function does not work well, because the vectorized version tries to do two sets of InterlockedAdd at once (although there are three InterlockedAdd in the original kernel, they write to adjacent addresses so it's fine). By writing two sets at once, the GPU is not able to insert more computations in the pipeline, therefore more pipeline stalls.
Vectorizing Iterate function works because it has no InterlockedAdd, so the execution flow is something like:
[full Mandelbrot computation] + [one step of Mandelbrot] + [InterlockedAdd] + [one step of Mandelbrot] + [InterlockedAdd] ...
Since one step of Mandelbrot is not enough to cover the pipeline stall cased by InterlockedAdd, optimizing the full Mandelbrot computation (which has probably still more computation than the pipeline stall cased by InterlockedAdd because the max iteration is 10000) works. Of course, this is just my speculation.
If this is correct, then maybe a better optimization idea is try to mix the IterateAndPlot function with another full Mandelbrot computation in order to cover as much as pipeline stall as possible. For example, right now a kernel runs only one sample (or two samples in the vectorized version). It should be modified to handle tens or hundreds of samples, and when plotting the path of the first sample, it would compute the full Mandelbrot of the second sample, and so on. Then hopefully the cost of the full Mandelbrot can be hidden by the IterateAndPlot function.
When I was doing my laundry, I thought about another thing: for many samples they don't plot their path, because they converge. Therefore, the function "Iterate" may be more important than previously believed. So I decided to optimize it using Voxilla's unroll idea. Since it's easier to do so with the scalar version, I unrolled it by 16, and replay the loop if it's "overflow," basically:
int Iterate_unroll( float zRe, float zIm )
{
float cRe = zRe;
float cIm = zIm;
// Doing the loop this way improves performance for an equivalent result
int nIteration=0;
float zRe2 = zRe*zRe;
float zIm2 = zIm*zIm;
float oldzRe;
float oldzIm;
while (nIteration<nMaxIteration && (zRe2 + zIm2 <= 4.0) )
{
oldzRe = zRe;
oldzIm = zIm;
[unroll]
for(int i = 0; i < 16; i++) {
zIm = 2.0*zRe*zIm + cIm;
zRe = zRe2 - zIm2 + cRe;
zRe2 = zRe*zRe;
zIm2 = zIm*zIm;
}
nIteration += (zRe2 + zIm2 <= 4.0) ? 16 : 0;
}
// replay the loop if it's overflowed
if(nIteration < nMaxIteration) {
// replay
zRe = oldzRe;
zIm = oldzIm;
zRe2 = zRe * zRe;
zIm2 = zIm * zIm;
while(nIteration < nMaxIteration && (zRe2 + zIm2 <= 4.0)) {
zIm = 2.0*zRe*zIm + cIm;
zRe = zRe2 - zIm2 + cRe;
zRe2 = zRe*zRe;
zIm2 = zIm*zIm;
nIteration++;
}
}
return nIteration >= nMaxIteration ? nMaxIteration : nIteration;
}
By using this version, it's now running at near 200 fps on my Radeon 5850. I think it can be even faster if the vectorized version is unrolled.
[EDIT] I got it to around 225 fps after unrolling the 2D version:
int2 Iterate2_unroll( float2 zRe, float2 zIm )
{
float2 cRe = zRe;
float2 cIm = zIm;
// Doing the loop this way improves performance for an equivalent result
int2 nIteration=int2(0, 0);
float2 zRe2 = zRe*zRe;
float2 zIm2 = zIm*zIm;
float2 oldzRe, oldzIm;
int2 cond = (sign(nIteration - int2(nMaxIteration, nMaxIteration)) & sign(-step(zRe2 + zIm2, float2(4.0, 4.0))));
while (any(cond))
{
oldzRe = zRe;
oldzIm = zIm;
[unroll]
for(int i = 0; i < 16; i++) {
zIm = float2(2.0, 2.0)*zRe*zIm + cIm;
zRe = zRe2 - zIm2 + cRe;
zRe2 = zRe*zRe;
zIm2 = zIm*zIm;
}
nIteration += (sign(step(zRe2 + zIm2, float2(4.0, 4.0))) * 16);
nIteration = clamp(nIteration, 0, int2(nMaxIteration, nMaxIteration));
cond = (sign(nIteration - int2(nMaxIteration, nMaxIteration)) & sign(-step(zRe2 + zIm2, float2(4.0, 4.0))));
}
nIteration = clamp(nIteration, 0, int2(nMaxIteration, nMaxIteration));
cond = sign(nIteration - int2(nMaxIteration, nMaxIteration));
if(any(cond)) {
zRe = oldzRe;
zIm = oldzIm;
zRe2 = zRe * zRe;
zIm2 = zIm * zIm;
int2 cond = (sign(nIteration - int2(nMaxIteration, nMaxIteration)) & sign(-step(zRe2 + zIm2, float2(4.0, 4.0))));
while(any(cond)) {
// calculate next z(n)
zIm = float2(2.0, 2.0)*zRe*zIm + cIm;
zRe = zRe2 - zIm2 + cRe;
zRe2 = zRe*zRe;
zIm2 = zIm*zIm;
nIteration += -cond;
cond = (sign(nIteration - int2(nMaxIteration, nMaxIteration)) & sign(-step(zRe2 + zIm2, float2(4.0, 4.0))));
}
}
return clamp(nIteration, 0, int2(nMaxIteration, nMaxIteration));
}
Four memory channels can only do 1.2 billion random read-modify-writes per second, right?
That's eight channels for Cypress. ;)
I think I made a mistake. The original program has a predefined number of samples per frame, which is 100x100. This is not very good fit for 16x16 work group size. Therefore, I changed it to 160x160. This way, comparing using fps is no longer appropriate. Sample rate should be used instead.
I redid these tests and the results are:
Original (100x100, w/ SuperCow's optimizations): 1100k/s
160x160: 1700k/s
160x160 (2D version): 2260k/s
160x160 (1D unrolled): 2700k/s
160x160 (2D unrolled): 3500k/s
Changing the number of samples by too much will make results incomparable, because the tone mapping and other post processing are only done per frame. For example, if I changed the number of samples to 320x320, it becomes 5700k/s, because the fps goes from around 140 fps to around 55 fps, thus less tone mapping and post processing need to be done.
[EDIT] I posted a version of my modified program for those who are interested in testing it :)
download (http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/2010_03_29_Buddhabrot_DirectCompute_mod.rar)
Albuquerque
15-Apr-2010, 07:13
I'm not sure what the default settings are for pcchen's code there, but I'm getting right at 3350k/sec sample rate on my 5850 overclocked to 950 core + 1150 mem. This is just launching the window with the default windowed size and default iteration rate.
I notice that if I maximize the window, the view center shifts upwards (ie, the rendering is all in the lower 2/3rds of the screen.)
The default is 2D unrolled. My test numbers are from the default windows size (640x480). Maximize the window may slow it down because the cost of post processing increases.
The problem you observed (shift of view center) seems to be related to the aspect ratio. It probably happens only on wide screens (I have this problem too).
GTX 480 - stock clocks, 197.44 drivers, Win7 x64
Original : 1308k/s
Original (160x160): 1424k/s
Original (160x160, w/ SuperCow's optimizations): 1024k/s
Original (160x160, w/ SuperCow's optimizations /wo SuperCow's IterateAndPlot): 1646k/s
160x160 (1D unrolled - /wo SuperCow's IterateAndPlot): 2055k/s
160x160 (2D unrolled - /wo SuperCow's IterateAndPlot): 1935/s
Lightman
16-Apr-2010, 23:57
Excellent thinking pcchen
So Evergreen can be quicker than GF100 in this little program! Amazing how big impact on performance right optimization can have :shock:
caveman-jim
17-Apr-2010, 03:38
Excellent thinking pcchen
So Evergreen can be quicker than GF100 in this little program! Amazing how big impact on performance right optimization can have :shock:
Architecture optimized code is faster on the target architecture than generic code on non-targeted architecture!?! say it ain't so!
Seriously - very cool guys. Love the contributions to the code and the discussion. Makes me wanna be a programmer again :)
yakiimo02
17-Apr-2010, 08:24
HD5750 Catalyst 10.2 Vista x64
original 420k/s
default pcchen's 2500k/s
Getting big performance increase. Thanks for the optimizations everyone. Will learn a lot from the optimized codes.
Flam4's GPGPU implementor Keldor suggested I use combined Tausworthe random number generator (GPU Gems3 article exists) to do parallel prng generation on the GPU, so that I can sample more than once per frame (originally tried parallel MWC, saw non-random patterns, didn't see optimization and was too lazy to pursue). Being able to generate PRNG in CS is big change and if I set target per thread iteration count, I can ameliorate threads finishing early and idly waiting for the thread with largest exit iteration to finish by sampling more until the thread hits iteration target.
Not really knowledgeable about GPU hardware, but assuming this will be a big win even with per thread increased scattered writes.
Flam4's GPGPU implementor Keldor suggested I use combined Tausworthe random number generator (GPU Gems3 article exists) to do parallel prng generation on the GPU, so that I can sample more than once per frame (originally tried parallel MWC, saw non-random patterns, didn't see optimization and was too lazy to pursue). Being able to generate PRNG in CS is big change and if I set target per thread iteration count, I can ameliorate threads finishing early and idly waiting for the thread with largest exit iteration to finish by sampling more until the thread hits iteration target.
Not really knowledgeable about GPU hardware, but assuming this will be a big win even with per thread increased scattered writes.
I think this is a great idea. To my understanding, in a wavefront, finished threads will have to wait for unfinished threads. Since converge samples don't do scattered writes, these threads will be idling for those which do. So this may help a lot.
I also tried to interleave some computation between scattered writes, but so far I am unable to make it run any faster. But I still think this may help somehow, especially with sampling more than once per frame.
keldor314
20-Apr-2010, 18:14
I think this is a great idea. To my understanding, in a wavefront, finished threads will have to wait for unfinished threads. Since converge samples don't do scattered writes, these threads will be idling for those which do. So this may help a lot.
I also tried to interleave some computation between scattered writes, but so far I am unable to make it run any faster. But I still think this may help somehow, especially with sampling more than once per frame.
Given the amount of hyperthreading used on current GPUs, I'd be surprised to see reordering the code have very much effect - it's already in effect doing this by swapping between perhaps 10 sets of threads per processor every instruction cycle.
That iterate loop looks like trouble, since it'll pretty much always iterate all the way to maxIterations due to the nature of the SIMD processors. In any SIMD group, you're almost certain to have at least one point that started inside the M set!
What we really need is some way to unify the code path for all cases. Maybe a two stage approach using an append/consume buffer to first accumulate escaping points and then rendering them? Something like this:
float cx= randFloat()*width+leftBound;
float cy= randFloat()*height+bottomBound;
float x = cx; //The mandelbrot equation makes it so the first iteration goes to (cx,cy)
float y = cy;
int n = 0;
while (n < maxIters)
{
for (int m=0;m<16;m++)
{
//Iterate
}
if (x*x+y*y>4.0f)
Append(x,y);
n++;
}
bool needAnotherPoint = true;
while (true) //repeat until append/consume buffer is empty
{
if (needAnotherPoint)
{
if (!Consume(&x,&y))
break;
cx=x;
cy=y;
needAnotherPoint=false;
}
for (int m=0;m<16,m++)
{
//Plot and Iterate (x,y)
}
if (x*x+y*y>4.0f)
needAnotherPoint=true;
}
keldor314
20-Apr-2010, 18:16
Bloody thing doesn't have an edit!
It should grab new values for x,y,cx,cy after each Append() in the code.
keldor314
20-Apr-2010, 18:17
Also, it should Append(cx,cy). Why isn't there an edit function?!
willardjuice
20-Apr-2010, 18:21
Why isn't there an edit function?!
You can edit after x amount of posts (where x is a number that I can't remember atm :razz:).
keldor314
20-Apr-2010, 18:22
float cx= randFloat()*width+leftBound;
float cy= randFloat()*height+bottomBound;
float x = cx; //The mandelbrot equation makes it so the first iteration goes to (cx,cy)
float y = cy;
int n = 0;
while (n < maxIters)
{
for (int m=0;m<16;m++)
{
//Iterate
}
if (x*x+y*y>4.0f)
{
Append(cx,cy); //The orbit starting at (cx,cy) escapes!
float cx= randFloat()*width+leftBound;
float cy= randFloat()*height+bottomBound;
float x = cx;
float y = cy;
}
n++;
}
bool needAnotherPoint = true;
while (true) //repeat until append/consume buffer is empty
{
if (needAnotherPoint)
{
if (!Consume(&x,&y))
break;
cx=x;
cy=y;
needAnotherPoint=false;
}
for (int m=0;m<16,m++)
{
//Plot and Iterate (x,y)
}
if (x*x+y*y>4.0f)
needAnotherPoint=true;
}
keldor314
20-Apr-2010, 18:23
Grr.....
float cx= randFloat()*width+leftBound;
float cy= randFloat()*height+bottomBound;
float x = cx; //The mandelbrot equation makes it so the first iteration goes to (cx,cy)
float y = cy;
int n = 0;
while (n < maxIters)
{
for (int m=0;m<16;m++)
{
//Iterate
}
if (x*x+y*y>4.0f)
{
Append(cx,cy); //The orbit starting at (cx,cy) escapes!
cx= randFloat()*width+leftBound;
cy= randFloat()*height+bottomBound;
x = cx;
y = cy;
}
n++;
}
bool needAnotherPoint = true;
while (true) //repeat until append/consume buffer is empty
{
if (needAnotherPoint)
{
if (!Consume(&x,&y))
break;
cx=x;
cy=y;
needAnotherPoint=false;
}
for (int m=0;m<16,m++)
{
//Plot and Iterate (x,y)
}
if (x*x+y*y>4.0f)
needAnotherPoint=true;
}
keldor314
20-Apr-2010, 18:24
It also should probably by n += 16 rather than n++, but meh :-P
keldor314
20-Apr-2010, 18:28
It's also worth noting that that 16 is just a guess - the best value will need some trial and error to find, and depends on the relative balance of the append/consume overhead to the iteration cost. The optimal value inside the Append loop is also likely different from that inside the Consume loop, since the proportions will be different. Note that we don't need to check for bounds every iteration since the exact number of iterations until escape is unimportant. Assuming of course that the iteration number for each color channel is divisible by 16, or whatever the inner loop count is!
keldor314
20-Apr-2010, 18:29
You can edit after x amount of posts (where x is a number that I can't remember atm :razz:).
So is there a thread where newbies can spam posts so that they get their edit function unlocked? >.<
Lightman
20-Apr-2010, 20:55
So is there a thread where newbies can spam posts so that they get their edit function unlocked? >.<
You're doing well in this thread :lol:
Seriously though you need 10 posts and 10 days on forum to get EDIT button.
rpg.314
20-Apr-2010, 21:10
So is there a thread where newbies can spam posts so that they get their edit function unlocked? >.<
http://forum.beyond3d.com/showthread.php?t=50668 :lol:
Given the amount of hyperthreading used on current GPUs, I'd be surprised to see reordering the code have very much effect - it's already in effect doing this by swapping between perhaps 10 sets of threads per processor every instruction cycle.
That would depend on the scoreboarding implementation though. I'm not familiar with Evergreen's implemention. :p
That iterate loop looks like trouble, since it'll pretty much always iterate all the way to maxIterations due to the nature of the SIMD processors. In any SIMD group, you're almost certain to have at least one point that started inside the M set!
What we really need is some way to unify the code path for all cases. Maybe a two stage approach using an append/consume buffer to first accumulate escaping points and then rendering them?
I think we probably don't need to use a real append/consume buffer though. A private array for each thread may be good enough. Or even better, a private shared array for each work group (the append/consume process can be done with atomic operations on shared memory). How do you think?
That download link doesn't seem to work in either firefox or IE - has the link changed guys?
That download link doesn't seem to work in either firefox or IE - has the link changed guys?
Do you mean my download link? You may have to click into it, "Save Link" probably doesn't work.
Also I tried using shared memory to simulate an append/consume buffer, but it doesn't work very well. I increased the input size from 160x160 to 320x320, but the number of work groups remains the same, so each work group will do 4 times works. A shared memory buffer the size of 16*16*8 is used for append/consume buffer. The result is around 4100k samples/s.
Simply increase the input size to 320x320 runs at around 5700 k samples/s.
I suspect that because there needs to be a groupsync between the Iterate part and the IterateAndPlot part, threads with faster Iterates still have to wait for other threads. It actually makes the situation worse because in the original version only threads in the same wavefront have to wait, but now all threads in the same work group have to wait. Maybe a real append/consume buffer could help?
I also thought about another idea: in theory, the time it takes to run an IterateAndPlot is proportional to the IterateEscape parameter. That means, if we have collected all IterateEscape for each samples, it should be possible to sort them so the time spent on IterateAndPlot for each thread is more balanced.
The code is here (shared_idx, shared_x, shared_y, and shared_escape are groupshared memory):
if(GI == 0) {
shared_idx = 0;
}
GroupMemoryBarrierWithGroupSync();
int y = DTid.y;
int idx;
for(int i = 0; i < 2; i++) {
int x = DTid.x;
for(int j = 0; j < 2; j++) {
float2 random = randomInput.Load( uint3(x, y, 0) ).xy;
float2 zRe;
float2 zIm;
zRe.x = random.x;
zIm.x = random.y;
random = randomInput.Load( uint3(x + nInputWidth / 2, y, 0) ).xy;
zRe.y = random.x;
zIm.y = random.y;
// check whether sample complex number is in the mandelbrot set.
int2 nIterationEscape;
nIterationEscape = Iterate2_unroll(zRe, zIm);
if(nIterationEscape.x != nMaxIteration) {
InterlockedAdd(shared_idx, 1, idx);
shared_x[idx] = zRe.x;
shared_y[idx] = zIm.x;
shared_escape[idx] = nIterationEscape.x;
}
if(nIterationEscape.y != nMaxIteration) {
InterlockedAdd(shared_idx, 1, idx);
shared_x[idx] = zRe.y;
shared_y[idx] = zIm.y;
shared_escape[idx] = nIterationEscape.y;
}
x += nInputWidth / 4;
}
y += nInputHeight / 2;
}
GroupMemoryBarrierWithGroupSync();
idx = GI;
while(idx < shared_idx) {
IterateAndPlot( shared_x[idx], shared_y[idx], nOutputWidth, nOutputHeight, shared_escape[idx]);
idx += 16 * 16;
}
Could someone with a GTX480 and the coding know how optimize it for the GF100 as it sems everything done so far has been to optimize it for cypress.
Could someone with a GTX480 and the coding know how optimize it for the GF100 as it sems everything done so far has been to optimize it for cypress.
P(GTX480) * P(know how) = very small number ;)
Most of the optimizations (except the vectorizing) are also valid for fermi, evergreen just benefits more from them (or you could say; needs them more). So I think it's mostly a matter of tweaking different sizes (and you more or less need a gf100 at hand for that).
But sure it would be interesting to see, to get the full picture.
This thread is already in line with our expectations that naive/generic code will usually run faster on gf100, while the tables are turned by a decent level of optimization, bringing cypress closer to it's peak rates.
P(GTX480) * P(know how) = very small number ;)
Most of the optimizations (except the vectorizing) are also valid for fermi, evergreen just benefits more from them (or you could say; needs them more). So I think it's mostly a matter of tweaking different sizes (and you more or less need a gf100 at hand for that).
But sure it would be interesting to see, to get the full picture.
This thread is already in line with our expectations that naive/generic code will usually run faster on gf100, while the tables are turned by a decent level of optimization, bringing cypress closer to it's peak rates.
Very true, I'd just like to see some scores from optimized code for both.
pcchen - no felix's original link returns a 404 error on the page with the download link. Is the latest code hosted anywhere else - link please!
You can get the original version from yakiimo's (the original author) website here:
http://www.yakiimo3d.com/2010/03/29/dx11-directcompute-buddhabrot-nebulabrot/
My modification is here:
http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/2010_03_29_Buddhabrot_DirectCompute_mod.rar
I'm getting a 404 Not Found when trying to download. Any mirrors?
The author's original code is hosted at codeplex here
http://yakiimo3d.codeplex.com/releases/view/42716
Man from Atlantis
31-Aug-2010, 01:40
You can get the original version from yakiimo's (the original author) website here:
http://www.yakiimo3d.com/2010/03/29/dx11-directcompute-buddhabrot-nebulabrot/
http://img827.imageshack.us/img827/7880/buddhabrotdirectcompute.png
My modification is here:
http://sites.google.com/a/kimicat.com/hotballshive/dang-an-jia/2010_03_29_Buddhabrot_DirectCompute_mod.rar
http://img255.imageshack.us/img255/7880/buddhabrotdirectcompute.png
seems no differance here for GTX 460
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.