DX11 direct compute mandelbrot viewer

Jawed · Oct 14, 2009

Yes, you're right about the hardware doing what your software technique does.

In general the performance should never be worse than 1024 iterations, as there's no extra code to run on the "else" side (i.e. the IF chooses to terminate, not to run other code).

GT200 has theoretically better incoherence penalties because it has 32 pixels in thread, against 64 on ATI (or 256 in the vector version). So on a screen with pixels that are mostly not black, GT200 should terminate faster. The question is how much faster?

The Julia set might be a better test though, as the control flow is more deeply nested.

Jawed

hoom · Oct 14, 2009

Yep, both are running in PS mode for me.

I get about 30fps at 1920*1200 windowed for the Julia (spikes to 60 in the bits where the set disappears, 25 at the most complex)

Zoomed in on an all black section of the Mandelbrot I get 21fps (11 with the 2048 iteration option)

OpenGL guy · Oct 14, 2009

You can get even better performance in the worst case by using conditional moves instead of branches in the inner loop. The difference is that you'll get the same framerate wherever you happen to be looking

Should be over 2 TFLOP/s on HD5870 I believe.

Voxilla · Oct 14, 2009

There already seem to be conditional moves, I don't see any branches in the inner loop.

mad r15.xyzw, r16.xyzw, r15.xyzw, r3.xyzw
add r11.xyzw, r15.xyzw, r15.xyzw
mul r15.xyzw, r11.xyzw, r11.xyzw
mad r15.xyzw, r17.xyzw, r17.xyzw, r15.xyzw
lt r12.xyzw, r15.xyzw, l(4.000000, 4.000000, 4.000000, 4.000000)
and r15.xyzw, r12.xyzw, l(-32, -32, -32, -32)
iadd r9.xyzw, r9.xyzw, r15.xyzw
movc r13.xyzw, r12.xyzw, r17.xyzw, r13.xyzw
movc r14.xyzw, r12.xyzw, r11.xyzw, r14.xyzw
mov r10.xyzw, r17.xyzw
endloop

OpenGL guy · Oct 14, 2009

Voxilla said:
There already seem to be conditional moves, I don't see any branches in the inner loop.

mad r15.xyzw, r16.xyzw, r15.xyzw, r3.xyzw
add r11.xyzw, r15.xyzw, r15.xyzw
mul r15.xyzw, r11.xyzw, r11.xyzw
mad r15.xyzw, r17.xyzw, r17.xyzw, r15.xyzw
lt r12.xyzw, r15.xyzw, l(4.000000, 4.000000, 4.000000, 4.000000)
and r15.xyzw, r12.xyzw, l(-32, -32, -32, -32)
iadd r9.xyzw, r9.xyzw, r15.xyzw
movc r13.xyzw, r12.xyzw, r17.xyzw, r13.xyzw
movc r14.xyzw, r12.xyzw, r11.xyzw, r14.xyzw
mov r10.xyzw, r17.xyzw
endloop

It's the breakc_z, I believe. Look further up the code. Essentially, change the while loop into a for loop with 1024 iterations, but still use conditional moves to update the escape time for each pixel.

Voxilla · Oct 14, 2009

I see what you mean, I tried something like below.
Despite the full unrolling it runs slower, 50 vs 67 fps.

Code:

  [unroll]
  for (int j=0; j<1024/UNROLL; j++)
	{ 
	  [unroll]
	  for (int i=0; i<UNROLL/2; i++)
	  {
		  t =    u*u + a - v*v;  
		  v = 2*(u*v + b); 
		  u =    t*t + a - v*v;  
		  v = 2*(t*v + b); 
		}
               inside = u*u+v*v < 4.0f;
		counter -= (inside) ? UNROLL : 0;
  	        ur =       (inside) ? u : ur;
		vr =       (inside) ? v : vr;		
	}

OpenGL guy · Oct 14, 2009

Voxilla said:
I see what you mean, I tried something like below.
Despite the full unrolling it runs slower, 50 vs 67 fps.

Hmm interesting. All I did was add a new counter, decrement by 16 each pass and set my do-while conditional to test if the counter is > 0. I go from 110 to 118 fps (1920x1200 resolution). I thought the gain was larger when I ran this last week.

Maybe it's the unrolling you are explicitly doing?

Voxilla · Oct 15, 2009

OpenGL guy said:
Hmm interesting. All I did was add a new counter, decrement by 16 each pass and set my do-while conditional to test if the counter is > 0. I go from 110 to 118 fps (1920x1200 resolution). I thought the gain was larger when I ran this last week.

I think the gain you saw already is taken with the latest uploaded version.
There I changed the inner loop unroll from 16 to 32 times.
This resulted also in about 10% speed improvement.

OpenGL guy · Oct 15, 2009

Voxilla said:
I think the gain you saw already is taken with the latest uploaded version.
There I changed the inner loop unroll from 16 to 32 times.
This resulted also in about 10% speed improvement.

Ah that could be it since I am using the older shader code.

Voxilla · Oct 18, 2009

I've plugged the vector code in a pixel shader and looked at what assembly is produced for the HD 4870 with ShaderAnalyser.

In order to compare it with the first post of OpenGL Guy the unroll is 16.
As can be seen the number of VLIW instructions is 52, down from 78.

The HD 4870 can make use of the free 2* with the MULADD_e*2.
With this the whole loop becomes MULADD and there is a free multiplication by 2.

With the HD 5870 I didn't see a speedup, which seems to imply the MULADD_e*2 instruction is not used.
Is there a reason why this instruction is not used, is it still present in the R870 GPU ?

Code:

02 LOOP_DX10 i0 FAIL_JUMP_ADDR(7) VALID_PIX 
    03 ALU_BREAK: ADDR(263) CNT(12) 
         48  x: SETNE_INT   R0.x,  R8.z,  0.0f      
             y: SETNE_INT   R0.y,  R8.y,  0.0f      
             z: SETNE_INT   R0.z,  R8.x,  0.0f      
             w: SETNE_INT   R0.w,  R8.w,  0.0f      
         49  x: AND_INT     R0.x,  R1.z,  PV48.x      
             y: AND_INT     R0.y,  R1.y,  PV48.y      
             z: AND_INT     R0.z,  R1.x,  PV48.z      
             w: AND_INT     R0.w,  R1.w,  PV48.w      
         50  y: OR_INT      R0.y,  PV49.y,  PV49.w      
             z: OR_INT      R0.z,  PV49.z,  PV49.x      
         51  x: OR_INT      R0.x,  PV50.z,  PV50.y      
         52  x: PREDNE_INT  ____,  R0.x,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
    04 ALU: ADDR(275) CNT(122) 
         53  x: MULADD_e    R0.x,  R5.w,  R5.w,  R7.y      VEC_120 
             y: MULADD_e    R0.y,  R5.z,  R5.z,  R7.x      
             z: MULADD_e    R0.z,  R5.y,  R5.y,  R7.y      VEC_210 
             w: MULADD_e    R0.w,  R5.x,  R5.x,  R7.x      VEC_102 
             t: MULADD_e*2  R1.z,  R5.y,  R4.y,  R9.x      
         54  x: MULADD_e    R1.x, -R4.w,  R4.w,  PV53.x      
             y: MULADD_e    R1.y, -R4.z,  R4.z,  PV53.y      
             z: MULADD_e    R0.z, -R4.y,  R4.y,  PV53.z      
             w: MULADD_e    R1.w, -R4.x,  R4.x,  PV53.w      VEC_120 
             t: MULADD_e*2  R0.w,  R5.x,  R4.x,  R9.x      
         55  x: MULADD_e*2  R0.x,  R5.w,  R4.w,  R7.z      
             y: MULADD_e*2  R0.y,  R5.z,  R4.z,  R7.z      
             z: MULADD_e    R2.z,  PV54.z,  PV54.z,  R7.y      
             w: MULADD_e    R2.w,  PV54.w,  PV54.w,  R7.x      
             t: MULADD_e    R2.y,  PV54.y,  PV54.y,  R7.x      
         56  x: MULADD_e    R2.x,  R1.x,  R1.x,  R7.y      VEC_120 
             y: MULADD_e    R2.y, -PV55.y,  PV55.y,  PS55      
             z: MULADD_e    R2.z, -R1.z,  R1.z,  PV55.z      
             w: MULADD_e    R0.w, -R0.w,  R0.w,  PV55.w      
             t: MULADD_e*2  R1.w,  R1.w,  R0.w,  R9.x      
         57  x: MULADD_e    R2.x, -R0.x,  R0.x,  PV56.x      VEC_120 
             y: MULADD_e*2  R0.y,  R1.y,  R0.y,  R7.z      VEC_120 
             z: MULADD_e*2  R0.z,  R0.z,  R1.z,  R9.x      VEC_120 
             w: MULADD_e*2  R2.w,  R1.x,  R0.x,  R7.z      VEC_210 
             t: MULADD_e    R1.z,  PV56.z,  PV56.z,  R7.y      
         58  x: MULADD_e    R0.x,  PV57.x,  PV57.x,  R7.y      
             y: MULADD_e    R1.y,  R2.y,  R2.y,  R7.x      
             z: MULADD_e    R1.z, -PV57.z,  PV57.z,  PS57      
             w: MULADD_e    R0.w,  R0.w,  R0.w,  R7.x      
             t: MULADD_e*2  R3.w,  R0.w,  R1.w,  R9.x      
         59  x: MULADD_e    R1.x, -R2.w,  R2.w,  PV58.x      
             y: MULADD_e    R1.y, -R0.y,  R0.y,  PV58.y      
             z: MULADD_e*2  R2.z,  R2.z,  R0.z,  R9.x      VEC_120 
             w: MULADD_e    R1.w, -R1.w,  R1.w,  PV58.w      VEC_120 
             t: MULADD_e*2  R2.y,  R2.y,  R0.y,  R7.z      
         60  x: MULADD_e*2  R2.x,  R2.x,  R2.w,  R7.z      
             y: MULADD_e    R3.y,  PV59.y,  PV59.y,  R7.x      
             z: MULADD_e    R3.z,  R1.z,  R1.z,  R7.y      
             w: MULADD_e    R2.w,  PV59.w,  PV59.w,  R7.x      
             t: MULADD_e    R3.x,  PV59.x,  PV59.x,  R7.y      
         61  x: MULADD_e    R0.x, -PV60.x,  PV60.x,  PS60      
             y: MULADD_e    R0.y, -R2.y,  R2.y,  PV60.y      
             z: MULADD_e    R0.z, -R2.z,  R2.z,  PV60.z      
             w: MULADD_e    R0.w, -R3.w,  R3.w,  PV60.w      
             t: MULADD_e*2  R1.w,  R1.w,  R3.w,  R9.x      
         62  x: MULADD_e*2  R1.x,  R1.x,  R2.x,  R7.z      VEC_120 
             y: MULADD_e*2  R1.y,  R1.y,  R2.y,  R7.z      VEC_120 
             z: MULADD_e*2  R1.z,  R1.z,  R2.z,  R9.x      VEC_120 
             w: MULADD_e    R2.w,  PV61.z,  PV61.z,  R7.y      VEC_120 
             t: MULADD_e    R2.x,  PV61.x,  PV61.x,  R7.y      
         63  x: MULADD_e    R2.x, -PV62.x,  PV62.x,  PS62      
             y: MULADD_e    R2.y,  R0.y,  R0.y,  R7.x      
             z: MULADD_e    R2.z, -PV62.z,  PV62.z,  PV62.w      
             w: MULADD_e    R0.w,  R0.w,  R0.w,  R7.x      
             t: MULADD_e*2  R2.w,  R0.w,  R1.w,  R9.x      
         64  x: MULADD_e*2  R0.x,  R0.y,  R1.y,  R7.z      VEC_120 
             y: MULADD_e    R2.y, -R1.y,  R1.y,  PV63.y      
             z: MULADD_e*2  R0.z,  R0.z,  R1.z,  R9.x      VEC_120 
             w: MULADD_e    R0.w, -R1.w,  R1.w,  PV63.w      
             t: MULADD_e*2  R1.x,  R0.x,  R1.x,  R7.z      
         65  x: MULADD_e    R3.x,  R2.x,  R2.x,  R7.y      VEC_120 
             y: MULADD_e    R0.y,  PV64.y,  PV64.y,  R7.x      
             z: MULADD_e    R1.z,  R2.z,  R2.z,  R7.y      
             w: MULADD_e    R0.w,  PV64.w,  PV64.w,  R7.x      
             t: MULADD_e*2  R1.w,  PV64.w,  R2.w,  R9.x      
         66  x: MULADD_e    R3.x, -R1.x,  R1.x,  PV65.x      VEC_120 
             y: MULADD_e    R1.y, -R0.x,  R0.x,  PV65.y      VEC_201 
             z: MULADD_e    R1.z, -R0.z,  R0.z,  PV65.z      
             w: MULADD_e    R2.w, -R2.w,  R2.w,  PV65.w      
             t: MULADD_e*2  R2.z,  R2.z,  R0.z,  R9.x      
         67  x: MULADD_e*2  R1.x,  R2.x,  R1.x,  R7.z      
             y: MULADD_e*2  R2.y,  R2.y,  R0.x,  R7.z      VEC_021 
             z: MULADD_e    R3.z,  PV66.z,  PV66.z,  R7.y      
             w: MULADD_e    R3.w,  PV66.x,  PV66.x,  R7.y      
         68  x: MULADD_e    R2.x, -PV67.x,  PV67.x,  PV67.w      
             y: MULADD_e    R3.y,  R1.y,  R1.y,  R7.x      
             z: MULADD_e    R3.z, -R2.z,  R2.z,  PV67.z      
             w: MULADD_e    R2.w,  R2.w,  R2.w,  R7.x      
             t: MULADD_e*2  R3.w,  R2.w,  R1.w,  R9.x      
         69  x: MULADD_e*2  R3.x,  R1.y,  R2.y,  R7.z      VEC_120 
             y: MULADD_e    R3.y, -R2.y,  R2.y,  PV68.y      
             z: MULADD_e*2  R2.z,  R1.z,  R2.z,  R9.x      VEC_120 
             w: MULADD_e    R2.w, -R1.w,  R1.w,  PV68.w      
             t: MULADD_e*2  R0.x,  R3.x,  R1.x,  R7.z      
         70  x: MULADD_e    R1.x,  R2.x,  R2.x,  R7.y      VEC_120 
             y: MULADD_e    R2.y,  PV69.y,  PV69.y,  R7.x      
             z: MULADD_e    R0.z,  R3.z,  R3.z,  R7.y      
             w: MULADD_e    R2.w,  PV69.w,  PV69.w,  R7.x      
             t: MULADD_e*2  R0.w,  PV69.w,  R3.w,  R9.x      
         71  x: MULADD_e    R1.x, -R0.x,  R0.x,  PV70.x      VEC_120 
             y: MULADD_e    R0.y, -R3.x,  R3.x,  PV70.y      VEC_201 
             z: MULADD_e    R0.z, -R2.z,  R2.z,  PV70.z      
             w: MULADD_e    R3.w, -R3.w,  R3.w,  PV70.w      
             t: MULADD_e*2  R3.z,  R3.z,  R2.z,  R9.x      
         72  x: MULADD_e*2  R0.x,  R2.x,  R0.x,  R7.z      
             y: MULADD_e*2  R3.y,  R3.y,  R3.x,  R7.z      VEC_021 
             z: MULADD_e    R1.z,  PV71.z,  PV71.z,  R7.y      
             w: MULADD_e    R1.w,  PV71.x,  PV71.x,  R7.y      
         73  x: MULADD_e    R3.x, -PV72.x,  PV72.x,  PV72.w      
             y: MULADD_e    R1.y,  R0.y,  R0.y,  R7.x      
             z: MULADD_e    R1.z, -R3.z,  R3.z,  PV72.z      
             w: MULADD_e    R1.w,  R3.w,  R3.w,  R7.x      
             t: MULADD_e*2  R3.w,  R3.w,  R0.w,  R9.x      
         74  x: MULADD_e*2  R1.x,  R0.y,  R3.y,  R7.z      VEC_120 
             y: MULADD_e    R1.y, -R3.y,  R3.y,  PV73.y      
             z: MULADD_e*2  R3.z,  R0.z,  R3.z,  R9.x      VEC_120 
             w: MULADD_e    R1.w, -R0.w,  R0.w,  PV73.w      
             t: MULADD_e*2  R0.x,  R1.x,  R0.x,  R7.z      
         75  x: MULADD_e    R2.x,  R3.x,  R3.x,  R7.y      VEC_120 
             y: MULADD_e    R3.y,  PV74.y,  PV74.y,  R7.x      
             z: MULADD_e    R0.z,  R1.z,  R1.z,  R7.y      
             w: MULADD_e    R1.w,  PV74.w,  PV74.w,  R7.x      
             t: MULADD_e*2  R0.w,  PV74.w,  R3.w,  R9.x      
         76  x: MULADD_e    R2.x, -R0.x,  R0.x,  PV75.x      VEC_120 
             y: MULADD_e    R0.y, -R1.x,  R1.x,  PV75.y      VEC_201 
             z: MULADD_e    R0.z, -R3.z,  R3.z,  PV75.z      
             w: MULADD_e    R1.w, -R3.w,  R3.w,  PV75.w      
             t: MULADD_e*2  R1.z,  R1.z,  R3.z,  R9.x      
         77  x: MULADD_e*2  R0.x,  R3.x,  R0.x,  R7.z      
             y: MULADD_e*2  R1.y,  R1.y,  R1.x,  R7.z      VEC_021 
             z: MULADD_e    R2.z,  PV76.z,  PV76.z,  R7.y      
             w: MULADD_e    R2.w,  PV76.x,  PV76.x,  R7.y      
    05 ALU: ADDR(397) CNT(102) 
         78  x: MULADD_e    R1.x, -R0.x,  R0.x,  R2.w      VEC_210 
             y: MULADD_e    R2.y,  R0.y,  R0.y,  R7.x      VEC_021 
             z: MULADD_e    R2.z, -R1.z,  R1.z,  R2.z      
             w: MULADD_e    R1.w,  R1.w,  R1.w,  R7.x      VEC_201 
             t: MULADD_e*2  R2.w,  R1.w,  R0.w,  R9.x      
         79  x: MULADD_e*2  R2.x,  R0.y,  R1.y,  R7.z      VEC_120 
             y: MULADD_e    R2.y, -R1.y,  R1.y,  PV78.y      
             z: MULADD_e*2  R1.z,  R0.z,  R1.z,  R9.x      VEC_120 
             w: MULADD_e    R1.w, -R0.w,  R0.w,  PV78.w      
             t: MULADD_e*2  R0.x,  R2.x,  R0.x,  R7.z      
         80  x: MULADD_e    R3.x,  R1.x,  R1.x,  R7.y      VEC_120 
             y: MULADD_e    R1.y,  PV79.y,  PV79.y,  R7.x      
             z: MULADD_e    R0.z,  R2.z,  R2.z,  R7.y      
             w: MULADD_e    R1.w,  PV79.w,  PV79.w,  R7.x      
             t: MULADD_e*2  R0.w,  PV79.w,  R2.w,  R9.x      
         81  x: MULADD_e    R3.x, -R0.x,  R0.x,  PV80.x      VEC_120 
             y: MULADD_e    R0.y, -R2.x,  R2.x,  PV80.y      VEC_201 
             z: MULADD_e    R0.z, -R1.z,  R1.z,  PV80.z      
             w: MULADD_e    R2.w, -R2.w,  R2.w,  PV80.w      
             t: MULADD_e*2  R2.z,  R2.z,  R1.z,  R9.x      
         82  x: MULADD_e*2  R0.x,  R1.x,  R0.x,  R7.z      
             y: MULADD_e*2  R2.y,  R2.y,  R2.x,  R7.z      VEC_021 
             z: MULADD_e    R3.z,  PV81.z,  PV81.z,  R7.y      
             w: MULADD_e    R3.w,  PV81.x,  PV81.x,  R7.y      
         83  x: MULADD_e    R2.x, -PV82.x,  PV82.x,  PV82.w      
             y: MULADD_e    R3.y,  R0.y,  R0.y,  R7.x      
             z: MULADD_e    R3.z, -R2.z,  R2.z,  PV82.z      
             w: MULADD_e    R2.w,  R2.w,  R2.w,  R7.x      
             t: MULADD_e*2  R3.w,  R2.w,  R0.w,  R9.x      
         84  x: MULADD_e*2  R3.x,  R0.y,  R2.y,  R7.z      VEC_120 
             y: MULADD_e    R3.y, -R2.y,  R2.y,  PV83.y      
             z: MULADD_e*2  R2.z,  R0.z,  R2.z,  R9.x      VEC_120 
             w: MULADD_e    R2.w, -R0.w,  R0.w,  PV83.w      
             t: MULADD_e*2  R0.x,  R3.x,  R0.x,  R7.z      
         85  x: MULADD_e    R1.x,  R2.x,  R2.x,  R7.y      VEC_120 
             y: MULADD_e    R2.y,  PV84.y,  PV84.y,  R7.x      
             z: MULADD_e    R0.z,  R3.z,  R3.z,  R7.y      
             w: MULADD_e    R2.w,  PV84.w,  PV84.w,  R7.x      
             t: MULADD_e*2  R0.w,  PV84.w,  R3.w,  R9.x      
         86  x: MULADD_e    R1.x, -R0.x,  R0.x,  PV85.x      VEC_120 
             y: MULADD_e    R0.y, -R3.x,  R3.x,  PV85.y      VEC_201 
             z: MULADD_e    R0.z, -R2.z,  R2.z,  PV85.z      
             w: MULADD_e    R3.w, -R3.w,  R3.w,  PV85.w      
             t: MULADD_e*2  R3.z,  R3.z,  R2.z,  R9.x      
         87  x: MULADD_e*2  R0.x,  R2.x,  R0.x,  R7.z      
             y: MULADD_e*2  R3.y,  R3.y,  R3.x,  R7.z      VEC_021 
             z: MULADD_e    R1.z,  PV86.z,  PV86.z,  R7.y      
             w: MULADD_e    R1.w,  PV86.x,  PV86.x,  R7.y      
         88  x: MULADD_e    R3.x, -PV87.x,  PV87.x,  PV87.w      
             y: MULADD_e    R1.y,  R0.y,  R0.y,  R7.x      
             z: MULADD_e    R1.z, -R3.z,  R3.z,  PV87.z      
             w: MULADD_e    R1.w,  R3.w,  R3.w,  R7.x      
             t: MULADD_e*2  R3.w,  R3.w,  R0.w,  R9.x      
         89  x: MULADD_e*2  R1.x,  R0.y,  R3.y,  R7.z      VEC_120 
             y: MULADD_e    R1.y, -R3.y,  R3.y,  PV88.y      
             z: MULADD_e*2  R3.z,  R0.z,  R3.z,  R9.x      VEC_120 
             w: MULADD_e    R1.w, -R0.w,  R0.w,  PV88.w      
             t: MULADD_e*2  R0.x,  R1.x,  R0.x,  R7.z      
         90  x: MULADD_e    R2.x,  R3.x,  R3.x,  R7.y      VEC_120 
             y: MULADD_e    R3.y,  PV89.y,  PV89.y,  R7.x      
             z: MULADD_e    R0.z,  R1.z,  R1.z,  R7.y      
             w: MULADD_e    R1.w,  PV89.w,  PV89.w,  R7.x      
             t: MULADD_e*2  R4.x,  PV89.w,  R3.w,  R9.x      
         91  x: MULADD_e    R5.x, -R3.w,  R3.w,  PV90.w      
             y: MULADD_e*2  R4.y,  R1.z,  R3.z,  R9.x      VEC_021 
             z: MULADD_e*2  R4.z,  R1.y,  R1.x,  R7.z      VEC_021 
             w: MUL_e       R3.w,  PS90,  PS90      
             t: MULADD_e    R5.y, -R3.z,  R3.z,  PV90.z      VEC_102 
         92  x: MUL_e       R1.x,  PV91.y,  PV91.y      
             y: MUL_e       R0.y,  PV91.z,  PV91.z      
             z: MULADD_e    R5.z, -R1.x,  R1.x,  R3.y      
             w: MULADD_e*2  R4.w,  R3.x,  R0.x,  R7.z      VEC_120 
             t: MULADD_e    R0.w,  PV91.x,  PV91.x,  PV91.w      
         93  x: MUL_e       R0.x,  PV92.w,  PV92.w      
             y: MULADD_e    R0.y,  PV92.z,  PV92.z,  PV92.y      
             z: MULADD_e    R0.z,  R5.y,  R5.y,  PV92.x      
             w: MULADD_e    R5.w, -R0.x,  R0.x,  R2.x      
             t: SETGT_DX10  R1.x,  (0x40800000, 4.0f).x,  PS92      
         94  x: MULADD_e    R0.x,  PV93.w,  PV93.w,  PV93.x      
             y: SETGT_DX10  R1.y,  (0x40800000, 4.0f).x,  PV93.z      
             z: SETGT_DX10  R1.z,  (0x40800000, 4.0f).x,  PV93.y      
             w: AND_INT     R0.w,  (0xFFFFFFF0, -1.#QNANf).y,  PS93      
             t: CNDE_INT    R6.x,  PS93,  R6.x,  R5.x      
         95  x: ADD_INT     R8.x,  R8.x,  PV94.w      
             y: AND_INT     R0.y,  (0xFFFFFFF0, -1.#QNANf).x,  PV94.z      
             z: AND_INT     R0.z,  (0xFFFFFFF0, -1.#QNANf).x,  PV94.y      
             w: SETGT_DX10  R1.w,  (0x40800000, 4.0f).y,  PV94.x      
             t: CNDE_INT    R6.y,  PV94.y,  R6.y,  R5.y      
         96  x: AND_INT     R0.x,  (0xFFFFFFF0, -1.#QNANf).x,  PV95.w      
             y: ADD_INT     R8.y,  R8.y,  PV95.z      
             z: ADD_INT     R8.z,  R8.z,  PV95.y      
             t: CNDE_INT    R10.x,  R1.x,  R10.x,  R4.x      
         97  y: CNDE_INT    R10.y,  R1.y,  R10.y,  R4.y      
             z: CNDE_INT    R6.z,  R1.z,  R6.z,  R5.z      
             w: ADD_INT     R8.w,  R8.w,  PV96.x      
         98  z: CNDE_INT    R10.z,  R1.z,  R10.z,  R4.z      
             w: CNDE_INT    R6.w,  R1.w,  R6.w,  R5.w      
         99  w: CNDE_INT    R10.w,  R1.w,  R10.w,  R4.w      
06 ENDLOOP i0 PASS_JUMP_ADDR(3)

OpenGL guy · Oct 18, 2009

Voxilla said:
With the HD 5870 I didn't see a speedup, which seems to imply the MULADD_e*2 instruction is not used.
Is there a reason why this instruction is not used, is it still present in the R870 GPU ?

No, it's no longer present.

Voxilla · Oct 18, 2009

OpenGL guy said:
No, it's no longer present.

Damn, could have added another 500 GFLOP/s.
I reckon the opcode was needed for some of the new instructions.

EduardoS · Oct 18, 2009

Wich kinds and variations of MULADDs are present on RV870?

ferro · Oct 18, 2009

Windows 7 support?

You mentioned Vista. Does this also run on Windows 7 DX11?

Note: it doesn't work for me. I just get a black screen and can press Escape to exit.

Psycho · Oct 18, 2009

I also got that, but that was because of the missing double support on the HD5770..
But after changing doubles back to float I instead get a really hard lockup after a few seconds of nothing on screen - no vpu recover, just the reset button..
Btw. the CS version of the 4D julia is running noticable slower than the PS version (looks like a straight port, but still a good deal slower).

OpenGL guy · Oct 19, 2009

ferro said:
You mentioned Vista. Does this also run on Windows 7 DX11?

Note: it doesn't work for me. I just get a black screen and can press Escape to exit.

Happens to me as well. Just alt-tab after the app loads.

Voxilla · Oct 19, 2009

Psycho said:
I also got that, but that was because of the missing double support on the HD5770..
But after changing doubles back to float I instead get a really hard lockup after a few seconds of nothing on screen - no vpu recover, just the reset button..
Btw. the CS version of the 4D julia is running noticable slower than the PS version (looks like a straight port, but still a good deal slower).

So I would suggest trying ALT + Tab, as OpenGL guy discovered.
I didn't plan to upgrade to Win 7, until then maybe some good soul can find the needed source code modification.
Maybe it is //sd.Flags = DXGI_SWAP_CHAIN_FLAG_ALLOW_MODE_SWITCH; on line 298 that should be uncommented.

Regarding the doubles, I didn't think of that, I should check for this optional shader 5 feature.
There will be an updated version coming with full support for doubles allowing deep zooming, that will be until the next SDK get's released and fixed.

Voxilla · Oct 19, 2009

Psycho said:
Btw. the CS version of the 4D julia is running noticable slower than the PS version (looks like a straight port, but still a good deal slower).

I noted that too, see the other thread. Maybe the thread allocation is more optimal with the PS version.

The slightly annoying thing, I think with compute shaders is that you manually have to define the number of threads per thread group. It is a bit trial and error until you get it right, and you don't know even on another GPU it will be right there. With pixel shaders you don't bother with that, it does some automatic thread allocation probably based on the number of registers needed by the shader.

pcchen · Oct 19, 2009

Voxilla said:
I noted that too, see the other thread. Maybe the thread allocation is more optimal with the PS version.

The slightly annoying thing, I think with compute shaders is that you manually have to define the number of threads per thread group. It is a bit trial and error until you get it right, and you don't know even on another GPU it will be right there. With pixel shaders you don't bother with that, it does some automatic thread allocation probably based on the number of registers needed by the shader.

This is something I think CS should support. In OpenCL, work group size can be decided by the implementation. For example, if you want to perform computation on one million numbers, you can just tell it to create one million work items, and the implementation should automatically decide how many work items a work group should have. In a sense this is very similar to a pixel shader.

ferro · Oct 19, 2009

OpenGL guy said:
Happens to me as well. Just alt-tab after the app loads.

Unfortunately that doesn't help. The screen remains black when using Alt-Tab.

By pressing Ctrl-Alt-Del I was able to see that there was in fact a hidden "Mandel.exe has stopped working" message.

Do you have it working on Windows 7?

DX11 direct compute mandelbrot viewer

Jawed

hoom

OpenGL guy

Voxilla

OpenGL guy

Voxilla

OpenGL guy

Voxilla

OpenGL guy

Voxilla

OpenGL guy

Voxilla

EduardoS

ferro

Psycho

OpenGL guy

Voxilla

Voxilla

pcchen

Moderator

ferro

Similar threads