Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 07-Oct-2009, 20:35   #1
Voxilla
Member
 
Join Date: Jun 2007
Posts: 218
Default DX11 direct compute mandelbrot viewer

Here a quite fast mandelbrot viewer, making use of DX11 and the DirectCompute API.

The set is calculated with up to 1024 iterations. Making use of the horsepower of DX11 GPUs enables real-time panning and zooming even at high resolution.

Two versions are included, a scalar one and a vectorized computation version.
Both generate the same output. The vectorized version was made after suboptimal performance on the ATI HD 5870 with scalar calculation.
Compared to the scalar version it runs twice faster on this GPU. Here we see the backside of a non scalar GPU architecture.
On forthcoming scalar Nvidia DX11 GPUs, probably the scalar version will run faster. Writing a vectorized compute shader is substantial more complicated.

On the HD 5870 computational throughput is well over 1.5 TFLOPS/s.

Full source code is included.
Remark that no drawing code was needed. It is possible to directly write to the backbuffer from the compute shader.

By pressing the space bar the calculation can be switched between scalar and vectorized.
Zooming in and out can be done with the left and right mouse button.
Panning is with moving the mouse. With the mouse inside an invisible quarter screen sized centered circle, panning stops.
With the A and Z keys the color can be cycled.
ALT + Enter goes to full screen.
Voxilla is offline   Reply With Quote
Old 07-Oct-2009, 21:00   #2
digitalwanderer
Dangerously Mirthful
 
Join Date: Feb 2002
Location: Highland, IN USA
Posts: 14,599
Default

It's failing on start for me, an error message saying it's stopped working comes up as soon as it fires up a window.

Probably me doing something stupid, any idea what?

Oh, my system: Phenom II X4 965 BE, Gigabyte MA790FXT-UDP5, 2x2GB Corsair XMS3, 1Gb ATi Radeon 5870, Corsair 620hx, 750GB HD Caviar Black
digitalwanderer is offline   Reply With Quote
Old 07-Oct-2009, 21:13   #3
Humus
Crazy coder
 
Join Date: Feb 2002
Location: Stockholm, Sweden
Posts: 3,134
Send a message via ICQ to Humus Send a message via MSN to Humus
Default

Quote:
Originally Posted by Voxilla View Post
Here a quite fast mandelbrot viewer, making use of DX11 and the DirectCompute API.
Doesn't run on DX10 cards I presume?

Edit: Judging the source code it appears not. Looking at the shader I would think it would be fairly easy to make it run on DX10 cards as well.
__________________
[ Visit my site ]
I speak for myself and only myself.

Last edited by Humus; 07-Oct-2009 at 21:18.
Humus is offline   Reply With Quote
Old 07-Oct-2009, 21:22   #4
Voxilla
Member
 
Join Date: Jun 2007
Posts: 218
Default

Quote:
Originally Posted by digitalwanderer View Post
It's failing on start for me, an error message saying it's stopped working comes up as soon as it fires up a window.

Probably me doing something stupid, any idea what?
Can you run the DX11 demos from the DX SDK ?
It requires the DX11 beta to be installed on Vista.
Voxilla is offline   Reply With Quote
Old 07-Oct-2009, 21:26   #5
Voxilla
Member
 
Join Date: Jun 2007
Posts: 218
Default

Quote:
Originally Posted by Humus View Post
Doesn't run on DX10 cards I presume?

Edit: Judging the source code it appears not. Looking at the shader I would think it would be fairly easy to make it run on DX10 cards as well.
It's making use of the unordered access view, not sure if that is supported with DX10 feature. For you it might be easy to get it running with DX10
Voxilla is offline   Reply With Quote
Old 07-Oct-2009, 21:26   #6
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 1,706
Send a message via Skype™ to fellix
Default

Once there was a 3D fractal generator written for Cg and it was rather impressive! I really want to see such thing ported to DC or OCL.
__________________
"Releasing a game in 2010 without AA is a completely foreign concept to me. If the technique you're using makes it impossible to use AA then you're using the wrong technique." -- Humus
fellix is offline   Reply With Quote
Old 07-Oct-2009, 21:34   #7
digitalwanderer
Dangerously Mirthful
 
Join Date: Feb 2002
Location: Highland, IN USA
Posts: 14,599
Default

Quote:
Originally Posted by Voxilla View Post
Can you run the DX11 demos from the DX SDK ?
It requires the DX11 beta to be installed on Vista.
Ooops, sorry. Windows 7 64-bit OEM, all legal and everything.
digitalwanderer is offline   Reply With Quote
Old 07-Oct-2009, 21:58   #8
Humus
Crazy coder
 
Join Date: Feb 2002
Location: Stockholm, Sweden
Posts: 3,134
Send a message via ICQ to Humus Send a message via MSN to Humus
Default

Quote:
Originally Posted by Voxilla View Post
It's making use of the unordered access view, not sure if that is supported with DX10 feature. For you it might be easy to get it running with DX10
RWTexture2D is not supported, but RWBuffer is.
__________________
[ Visit my site ]
I speak for myself and only myself.
Humus is offline   Reply With Quote
Old 07-Oct-2009, 23:15   #9
Sinistar
Member
 
Join Date: Aug 2004
Location: Indiana
Posts: 318
Default

Works fine here, Vista64 updated to DX11.
Sinistar is offline   Reply With Quote
Old 08-Oct-2009, 15:29   #10
Voxilla
Member
 
Join Date: Jun 2007
Posts: 218
Default

I've just uploaded a slightly enhanced version.

The starting positions are calculated with doubles now. This results in less noise when zooming in.
Full double floating point does not work unfortunately as the HLSL compiler has bugs preventing any serious work with doubles for the moment.
I could not get the scalar version working with double start positions so there is a rendering difference now.
There are some other cosmetic changes as you can notice.
Voxilla is offline   Reply With Quote
Old 08-Oct-2009, 21:29   #11
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,146
Send a message via ICQ to OpenGL guy
Default

Neat demo! Having done my Ph.D in mathematics on fractal geometry it was great to see this running so fast! My friends and I used to generate Mandelbrot pictures on our Apple IIs back in high school so things have certainly come a long way since then

I took the liberty of compiling the shaders and I extracted the main loop from the vectorized version. Here it is:
Code:
04 LOOP_DX10 i0 FAIL_JUMP_ADDR(11) 
    05 ALU_BREAK: ADDR(470) CNT(16) 
         88  x: MOV         R11.x,  R0.z      
             y: MOV         R11.y,  R0.w      
             z: MOV         R5.z,  R0.x      
             w: MOV         R9.w,  R0.y      
             t: OR_INT      T0.w,  R2.x,  R1.x      
         89  x: SETNE_INT   ____,  PV88.y,  0.0f      
             y: SETNE_INT   ____,  PV88.x,  0.0f      
             z: SETNE_INT   ____,  PV88.z,  0.0f      
             w: SETNE_INT   ____,  PV88.w,  0.0f      
             t: OR_INT      ____,  R2.y,  R1.y      
         90  x: AND_INT     ____,  PV89.x,  PV89.w      
             y: AND_INT     ____,  PV89.y,  PV89.z      
             z: OR_INT      T0.z,  T0.w,  PS89      
         91  z: AND_INT     ____,  PV90.y,  PV90.x      
         92  z: AND_INT     R1.z,  PV91.z,  T0.z      
         93  x: PREDNE_INT  ____,  R1.z,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED 
    06 ALU: ADDR(486) CNT(124) 
         94  x: MUL_e       ____,  R6.y,  R6.y      
             y: MUL_e       ____,  R6.x,  R6.x      
             z: MUL_e       ____,  R6.w,  R6.w      
             w: MUL_e       ____,  R6.z,  R6.z      
             t: MUL_e       T0.y,  R4.x,  R6.x      
         95  x: MULADD_e    ____,  R4.y,  R4.y, -PV94.x      
             y: MULADD_e    ____,  R4.x,  R4.x, -PV94.y      
             z: MULADD_e    ____,  R4.w,  R4.w, -PV94.z      
             w: MULADD_e    ____,  R4.z,  R4.z, -PV94.w      
             t: MUL_e       T0.x,  R4.y,  R6.y      
         96  x: ADD         T1.x,  R5.y,  PV95.x      
             y: ADD         T1.y,  R3.x,  PV95.y      
             z: ADD         T0.z,  R5.y,  PV95.z      
             w: ADD         T0.w,  R3.x,  PV95.w      
             t: MUL_e       ____,  R4.z,  R6.z      
         97  x: MULADD_e    T0.x,  T0.x,  (0x40000000, 2.0f).x,  R10.x      
             y: MULADD_e    ____,  T0.y,  (0x40000000, 2.0f).x,  R10.x      
             z: MUL_e       ____,  R4.w,  R6.w      
             w: MULADD_e    T1.w,  PS96,  (0x40000000, 2.0f).x,  R10.y      
         98  x: MUL_e       ____,  PV97.x,  PV97.x      
             y: MUL_e       ____,  PV97.y,  PV97.y      
             z: MULADD_e    T1.z,  PV97.z,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PV97.w,  PV97.w      
             t: MUL_e       T0.y,  T1.y,  PV97.y      
         99  x: MULADD_e    ____,  T1.x,  T1.x, -PV98.x      
             y: MULADD_e    ____,  T1.y,  T1.y, -PV98.y      
             z: MUL_e       ____,  PV98.z,  PV98.z      
             w: MULADD_e    ____,  T0.w,  T0.w, -PV98.w      
             t: MUL_e       T0.x,  T1.x,  T0.x      
        100  x: ADD         T1.x,  R5.y,  PV99.x      
             y: ADD         T1.y,  R3.x,  PV99.y      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV99.z      
             w: ADD         T1.w,  R3.x,  PV99.w      
             t: MUL_e       ____,  T0.w,  T1.w      
        101  x: MUL_e       ____,  T0.z,  T1.z      
             y: MULADD_e    ____,  T0.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         T0.z,  R5.y,  PV100.z      VEC_120 
             w: MULADD_e    T0.w,  T0.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T2.w,  PS100,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        102  x: MUL_e       ____,  PV101.w,  PV101.w      
             y: MUL_e       ____,  PV101.y,  PV101.y      
             z: MULADD_e    T1.z,  PV101.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS101,  PS101      
             t: MUL_e       T0.y,  T1.y,  PV101.y      
        103  x: MULADD_e    ____,  T1.x,  T1.x, -PV102.x      
             y: MULADD_e    ____,  T1.y,  T1.y, -PV102.y      
             z: MUL_e       ____,  PV102.z,  PV102.z      
             w: MULADD_e    ____,  T1.w,  T1.w, -PV102.w      
             t: MUL_e       T1.x,  T1.x,  T0.w      
        104  x: ADD         T0.x,  R5.y,  PV103.x      
             y: ADD         T1.y,  R3.x,  PV103.y      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV103.z      
             w: ADD         T2.w,  R3.x,  PV103.w      
             t: MUL_e       ____,  T1.w,  T2.w      
        105  x: MUL_e       ____,  T0.z,  T1.z      
             y: MULADD_e    ____,  T0.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         T0.z,  R5.y,  PV104.z      VEC_120 
             w: MULADD_e    T1.w,  T1.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T0.w,  PS104,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        106  x: MUL_e       ____,  PV105.w,  PV105.w      
             y: MUL_e       ____,  PV105.y,  PV105.y      
             z: MULADD_e    T1.z,  PV105.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS105,  PS105      
             t: MUL_e       T0.y,  T1.y,  PV105.y      
        107  x: MULADD_e    ____,  T0.x,  T0.x, -PV106.x      
             y: MULADD_e    ____,  T1.y,  T1.y, -PV106.y      
             z: MUL_e       ____,  PV106.z,  PV106.z      
             w: MULADD_e    ____,  T2.w,  T2.w, -PV106.w      
             t: MUL_e       T0.x,  T0.x,  T1.w      
        108  x: ADD         T1.x,  R5.y,  PV107.x      
             y: ADD         T1.y,  R3.x,  PV107.y      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV107.z      
             w: ADD         T2.w,  R3.x,  PV107.w      
             t: MUL_e       ____,  T2.w,  T0.w      
        109  x: MUL_e       ____,  T0.z,  T1.z      
             y: MULADD_e    ____,  T0.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         T0.z,  R5.y,  PV108.z      VEC_120 
             w: MULADD_e    T0.w,  T0.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T1.w,  PS108,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        110  x: MUL_e       ____,  PV109.w,  PV109.w      
             y: MUL_e       ____,  PV109.y,  PV109.y      
             z: MULADD_e    T1.z,  PV109.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS109,  PS109      
             t: MUL_e       T0.y,  T1.y,  PV109.y      
        111  x: MULADD_e    ____,  T1.x,  T1.x, -PV110.x      
             y: MULADD_e    ____,  T1.y,  T1.y, -PV110.y      
             z: MUL_e       ____,  PV110.z,  PV110.z      
             w: MULADD_e    ____,  T2.w,  T2.w, -PV110.w      
             t: MUL_e       T1.x,  T1.x,  T0.w      
        112  x: ADD         T0.x,  R5.y,  PV111.x      
             y: ADD         T1.y,  R3.x,  PV111.y      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV111.z      
             w: ADD         T2.w,  R3.x,  PV111.w      
             t: MUL_e       ____,  T2.w,  T1.w      
        113  x: MUL_e       ____,  T0.z,  T1.z      
             y: MULADD_e    ____,  T0.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         R1.z,  R5.y,  PV112.z      VEC_120 
             w: MULADD_e    T1.w,  T1.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T0.w,  PS112,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        114  x: MUL_e       ____,  PV113.w,  PV113.w      
             y: MUL_e       ____,  PV113.y,  PV113.y      
             z: MULADD_e    R2.z,  PV113.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS113,  PS113      
             t: MUL_e       R0.y,  T1.y,  PV113.y      
        115  x: MULADD_e    ____,  T0.x,  T0.x, -PV114.x      
             y: MULADD_e    ____,  T1.y,  T1.y, -PV114.y      
             z: MUL_e       ____,  PV114.z,  PV114.z      
             w: MULADD_e    ____,  T2.w,  T2.w, -PV114.w      
             t: MUL_e       R0.x,  T0.x,  T1.w      
        116  x: ADD         R1.x,  R5.y,  PV115.x      
             y: ADD         R1.y,  R3.x,  PV115.y      
             z: MULADD_e    R0.z,  R1.z,  R1.z, -PV115.z      
             w: ADD         R1.w,  R3.x,  PV115.w      
             t: MUL_e       R0.w,  T2.w,  T0.w      
    07 ALU: ADDR(610) CNT(122) 
        117  x: MUL_e       ____,  R1.z,  R2.z      
             y: MULADD_e    ____,  R0.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         T0.z,  R5.y,  R0.z      VEC_120 
             w: MULADD_e    T0.w,  R0.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T2.w,  R0.w,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        118  x: MUL_e       ____,  PV117.w,  PV117.w      
             y: MUL_e       ____,  PV117.y,  PV117.y      
             z: MULADD_e    T1.z,  PV117.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS117,  PS117      
             t: MUL_e       T1.y,  R1.y,  PV117.y      
        119  x: MULADD_e    ____,  R1.x,  R1.x, -PV118.x      
             y: MULADD_e    ____,  R1.y,  R1.y, -PV118.y      
             z: MUL_e       ____,  PV118.z,  PV118.z      
             w: MULADD_e    ____,  R1.w,  R1.w, -PV118.w      
             t: MUL_e       T0.x,  R1.x,  T0.w      
        120  x: ADD         T1.x,  R5.y,  PV119.x      
             y: ADD         T0.y,  R3.x,  PV119.y      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV119.z      
             w: ADD         T0.w,  R3.x,  PV119.w      
             t: MUL_e       ____,  R1.w,  T2.w      
        121  x: MUL_e       ____,  T0.z,  T1.z      
             y: MULADD_e    ____,  T1.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         T0.z,  R5.y,  PV120.z      VEC_120 
             w: MULADD_e    T2.w,  T0.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T1.w,  PS120,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        122  x: MUL_e       ____,  PV121.w,  PV121.w      
             y: MUL_e       ____,  PV121.y,  PV121.y      
             z: MULADD_e    T1.z,  PV121.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS121,  PS121      
             t: MUL_e       T1.y,  T0.y,  PV121.y      
        123  x: MULADD_e    ____,  T1.x,  T1.x, -PV122.x      
             y: MULADD_e    ____,  T0.y,  T0.y, -PV122.y      
             z: MUL_e       ____,  PV122.z,  PV122.z      
             w: MULADD_e    ____,  T0.w,  T0.w, -PV122.w      
             t: MUL_e       T1.x,  T1.x,  T2.w      
        124  x: ADD         T0.x,  R5.y,  PV123.x      
             y: ADD         T0.y,  R3.x,  PV123.y      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV123.z      
             w: ADD         T1.w,  R3.x,  PV123.w      
             t: MUL_e       ____,  T0.w,  T1.w      
        125  x: MUL_e       ____,  T0.z,  T1.z      
             y: MULADD_e    ____,  T1.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         T0.z,  R5.y,  PV124.z      VEC_120 
             w: MULADD_e    T0.w,  T1.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T2.w,  PS124,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        126  x: MUL_e       ____,  PV125.w,  PV125.w      
             y: MUL_e       ____,  PV125.y,  PV125.y      
             z: MULADD_e    T1.z,  PV125.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS125,  PS125      
             t: MUL_e       T1.y,  T0.y,  PV125.y      
        127  x: MULADD_e    ____,  T0.x,  T0.x, -PV126.x      
             y: MULADD_e    ____,  T0.y,  T0.y, -PV126.y      
             z: MUL_e       ____,  PV126.z,  PV126.z      
             w: MULADD_e    ____,  T1.w,  T1.w, -PV126.w      
             t: MUL_e       T0.x,  T0.x,  T0.w      
        128  x: ADD         T1.x,  R5.y,  PV127.x      
             y: ADD         T0.y,  R3.x,  PV127.y      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV127.z      
             w: ADD         T2.w,  R3.x,  PV127.w      
             t: MUL_e       ____,  T1.w,  T2.w      
        129  x: MUL_e       ____,  T0.z,  T1.z      
             y: MULADD_e    ____,  T1.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         T0.z,  R5.y,  PV128.z      VEC_120 
             w: MULADD_e    T1.w,  T0.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T0.w,  PS128,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        130  x: MUL_e       ____,  PV129.w,  PV129.w      
             y: MUL_e       ____,  PV129.y,  PV129.y      
             z: MULADD_e    T1.z,  PV129.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS129,  PS129      
             t: MUL_e       T1.y,  T0.y,  PV129.y      
        131  x: MULADD_e    ____,  T1.x,  T1.x, -PV130.x      
             y: MULADD_e    ____,  T0.y,  T0.y, -PV130.y      
             z: MUL_e       ____,  PV130.z,  PV130.z      
             w: MULADD_e    ____,  T2.w,  T2.w, -PV130.w      
             t: MUL_e       T1.x,  T1.x,  T1.w      
        132  x: ADD         T0.x,  R5.y,  PV131.x      
             y: ADD         T0.y,  R3.x,  PV131.y      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV131.z      
             w: ADD         T2.w,  R3.x,  PV131.w      
             t: MUL_e       ____,  T2.w,  T0.w      
        133  x: MUL_e       ____,  T0.z,  T1.z      
             y: MULADD_e    ____,  T1.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         T0.z,  R5.y,  PV132.z      VEC_120 
             w: MULADD_e    T0.w,  T1.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T1.w,  PS132,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        134  x: MUL_e       ____,  PV133.w,  PV133.w      
             y: MUL_e       ____,  PV133.y,  PV133.y      
             z: MULADD_e    T1.z,  PV133.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS133,  PS133      
             t: MUL_e       T1.y,  T0.y,  PV133.y      
        135  x: MULADD_e    ____,  T0.x,  T0.x, -PV134.x      
             y: MULADD_e    ____,  T0.y,  T0.y, -PV134.y      
             z: MUL_e       ____,  PV134.z,  PV134.z      
             w: MULADD_e    ____,  T2.w,  T2.w, -PV134.w      
             t: MUL_e       T0.x,  T0.x,  T0.w      
        136  x: ADD         R0.x,  R5.y,  PV135.x      
             y: ADD         R1.y,  R3.x,  PV135.y      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV135.z      
             w: ADD         R2.w,  R3.x,  PV135.w      
             t: MUL_e       ____,  T2.w,  T1.w      
        137  x: MUL_e       ____,  T0.z,  T1.z      
             y: MULADD_e    ____,  T1.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         R0.z,  R5.y,  PV136.z      VEC_120 
             w: MULADD_e    R0.w,  T0.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    R3.w,  PS136,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        138  x: MUL_e       R1.x,  PV137.w,  PV137.w      
             y: MUL_e       R0.y,  PV137.y,  PV137.y      
             z: MULADD_e    R1.z,  PV137.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       R1.w,  PS137,  PS137      
             t: MUL_e       R2.y,  R1.y,  PV137.y      
    08 ALU: ADDR(732) CNT(122) 
        139  x: MULADD_e    ____,  R0.x,  R0.x, -R1.x      VEC_021 
             y: MULADD_e    ____,  R1.y,  R1.y, -R0.y      
             z: MUL_e       ____,  R1.z,  R1.z      
             w: MULADD_e    ____,  R2.w,  R2.w, -R1.w      
             t: MUL_e       T0.x,  R0.x,  R0.w      
        140  x: ADD         T1.x,  R5.y,  PV139.x      
             y: ADD         T1.y,  R3.x,  PV139.y      
             z: MULADD_e    ____,  R0.z,  R0.z, -PV139.z      
             w: ADD         T2.w,  R3.x,  PV139.w      
             t: MUL_e       ____,  R2.w,  R3.w      
        141  x: MUL_e       ____,  R0.z,  R1.z      
             y: MULADD_e    ____,  R2.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         T0.z,  R5.y,  PV140.z      VEC_120 
             w: MULADD_e    T1.w,  T0.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T0.w,  PS140,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        142  x: MUL_e       ____,  PV141.w,  PV141.w      
             y: MUL_e       ____,  PV141.y,  PV141.y      
             z: MULADD_e    T1.z,  PV141.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS141,  PS141      
             t: MUL_e       T0.y,  T1.y,  PV141.y      
        143  x: MULADD_e    ____,  T1.x,  T1.x, -PV142.x      
             y: MULADD_e    ____,  T1.y,  T1.y, -PV142.y      
             z: MUL_e       ____,  PV142.z,  PV142.z      
             w: MULADD_e    ____,  T2.w,  T2.w, -PV142.w      
             t: MUL_e       T1.x,  T1.x,  T1.w      
        144  x: ADD         T0.x,  R5.y,  PV143.x      
             y: ADD         T1.y,  R3.x,  PV143.y      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV143.z      
             w: ADD         T2.w,  R3.x,  PV143.w      
             t: MUL_e       ____,  T2.w,  T0.w      
        145  x: MUL_e       ____,  T0.z,  T1.z      
             y: MULADD_e    ____,  T0.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         T0.z,  R5.y,  PV144.z      VEC_120 
             w: MULADD_e    T0.w,  T1.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T1.w,  PS144,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        146  x: MUL_e       ____,  PV145.w,  PV145.w      
             y: MUL_e       ____,  PV145.y,  PV145.y      
             z: MULADD_e    T1.z,  PV145.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS145,  PS145      
             t: MUL_e       T0.y,  T1.y,  PV145.y      
        147  x: MULADD_e    ____,  T0.x,  T0.x, -PV146.x      
             y: MULADD_e    ____,  T1.y,  T1.y, -PV146.y      
             z: MUL_e       ____,  PV146.z,  PV146.z      
             w: MULADD_e    ____,  T2.w,  T2.w, -PV146.w      
             t: MUL_e       T0.x,  T0.x,  T0.w      
        148  x: ADD         T1.x,  R5.y,  PV147.x      
             y: ADD         T1.y,  R3.x,  PV147.y      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV147.z      
             w: ADD         T2.w,  R3.x,  PV147.w      
             t: MUL_e       ____,  T2.w,  T1.w      
        149  x: MUL_e       ____,  T0.z,  T1.z      
             y: MULADD_e    ____,  T0.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         T0.z,  R5.y,  PV148.z      VEC_120 
             w: MULADD_e    T1.w,  T0.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T0.w,  PS148,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        150  x: MUL_e       ____,  PV149.w,  PV149.w      
             y: MUL_e       ____,  PV149.y,  PV149.y      
             z: MULADD_e    T1.z,  PV149.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS149,  PS149      
             t: MUL_e       T0.y,  T1.y,  PV149.y      
        151  x: MULADD_e    ____,  T1.x,  T1.x, -PV150.x      
             y: MULADD_e    ____,  T1.y,  T1.y, -PV150.y      
             z: MUL_e       ____,  PV150.z,  PV150.z      
             w: MULADD_e    ____,  T2.w,  T2.w, -PV150.w      
             t: MUL_e       T1.x,  T1.x,  T1.w      
        152  x: ADD         T0.x,  R5.y,  PV151.x      
             y: ADD         T1.y,  R3.x,  PV151.y      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV151.z      
             w: ADD         T2.w,  R3.x,  PV151.w      
             t: MUL_e       ____,  T2.w,  T0.w      
        153  x: MUL_e       ____,  T0.z,  T1.z      
             y: MULADD_e    ____,  T0.y,  (0x40000000, 2.0f).x,  R10.x      
             z: ADD         T0.z,  R5.y,  PV152.z      VEC_120 
             w: MULADD_e    T0.w,  T1.x,  (0x40000000, 2.0f).x,  R10.x      
             t: MULADD_e    T1.w,  PS152,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        154  x: MUL_e       ____,  PV153.w,  PV153.w      
             y: MUL_e       ____,  PV153.y,  PV153.y      
             z: MULADD_e    T1.z,  PV153.x,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
             w: MUL_e       ____,  PS153,  PS153      
             t: MUL_e       T0.y,  T1.y,  PV153.y      
        155  x: MULADD_e    ____,  T0.x,  T0.x, -PV154.x      
             y: MULADD_e    ____,  T1.y,  T1.y, -PV154.y      
             z: MUL_e       ____,  PV154.z,  PV154.z      
             w: MULADD_e    ____,  T2.w,  T2.w, -PV154.w      
             t: MUL_e       T0.x,  T0.x,  T0.w      
        156  x: ADD         R4.x,  R3.x,  PV155.y      
             y: ADD         R4.y,  R5.y,  PV155.x      
             z: MULADD_e    ____,  T0.z,  T0.z, -PV155.z      
             w: MUL_e       ____,  T2.w,  T1.w      
             t: ADD         R4.z,  R3.x,  PV155.w      
        157  x: MULADD_e    R6.x,  T0.y,  (0x40000000, 2.0f).x,  R10.x      
             y: MULADD_e    R6.y,  T0.x,  (0x40000000, 2.0f).x,  R10.x      
             z: MUL_e       ____,  T0.z,  T1.z      
             w: ADD         R4.w,  R5.y,  PV156.z      VEC_120 
             t: MULADD_e    R6.z,  PV156.w,  (0x40000000, 2.0f).x,  R10.y      VEC_021 
        158  x: MUL_e       ____,  PV157.y,  PV157.y      
             y: MUL_e       ____,  PV157.x,  PV157.x      
             z: MUL_e       ____,  PS157,  PS157      
             w: MULADD_e    R6.w,  PV157.z,  (0x40000000, 2.0f).x,  R10.y      
        159  x: MULADD_e    T0.x,  R4.y,  R4.y,  PV158.x      
             y: MULADD_e    ____,  R4.x,  R4.x,  PV158.y      
             z: MUL_e       ____,  PV158.w,  PV158.w      
             w: MULADD_e    T1.w,  R4.z,  R4.z,  PV158.z      
        160  x: SETGT_DX10  R2.x,  (0x40800000, 4.0f).x,  PV159.y      
             z: MULADD_e    ____,  R4.w,  R4.w,  PV159.z      
        161  x: CNDE_INT    R8.x,  PV160.x,  R8.x,  R4.x      
             y: SETGT_DX10  R1.y,  (0x40800000, 4.0f).x,  PV160.z      
             z: SETGT_DX10  R1.z,  (0x40800000, 4.0f).x,  T0.x      VEC_102 
             w: SETGT_DX10  R1.w,  (0x40800000, 4.0f).x,  T1.w      
             t: AND_INT     R2.y,  PV160.x,  (0xFFFFFFF0, -1.#QNANf).y      
    09 ALU: ADDR(854) CNT(18) 
        162  x: CNDE_INT    R7.x,  R2.x,  R7.x,  R6.x      
             y: CNDE_INT    R8.y,  R1.z,  R8.y,  R4.y      VEC_201 
             z: CNDE_INT    R8.z,  R1.w,  R8.z,  R4.z      VEC_201 
             w: CNDE_INT    R8.w,  R1.y,  R8.w,  R4.w      VEC_201 
             t: AND_INT     T0.x,  R1.z,  (0xFFFFFFF0, -1.#QNANf).x      
        163  x: AND_INT     ____,  R1.w,  (0xFFFFFFF0, -1.#QNANf).x      VEC_201 
             y: CNDE_INT    R7.y,  R1.z,  R7.y,  R6.y      VEC_201 
             z: CNDE_INT    R7.z,  R1.w,  R7.z,  R6.z      VEC_201 
             w: CNDE_INT    R7.w,  R1.y,  R7.w,  R6.w      VEC_201 
             t: AND_INT     ____,  R1.y,  (0xFFFFFFF0, -1.#QNANf).x      
        164  x: ADD_INT     R0.x,  R5.z,  PV163.x      
             y: ADD_INT     R0.y,  R9.w,  PS163      
             z: ADD_INT     R0.z,  R11.x,  R2.y      
             w: ADD_INT     R0.w,  R11.y,  T0.x      
        165  x: MOV         R1.x,  R1.w      
             y: MOV         R2.y,  R1.z      
10 ENDLOOP i0 PASS_JUMP_ADDR(5)
As you can see, 68 of the 78 ALU instructions have all 5 slots populated, who says the 't' unit never gets used!

BTW, Julia Sets are also interesting to look at and can be generated in a similar manner as the Mandelbrot Set. The difference is that instead of starting with z=0 and setting z -> z^2 + C where C is the starting value for a given point, you instead fix C to be the value you wish to compute the Julia Set for and vary the starting value of z based on which pixel you are computing. For the simplest example, J_c where C = 0 is the unit circle. (The Julia Set doesn't include the interior.) This is easy to see as for all z with |z| < 1, then z -> z^2 + 0 will tend to the Origin and for all z with |z| > 1, then these will tend to infinity.

J_i, a dendrite, is a neat one to look at if you get a chance to try it out.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 09-Oct-2009, 11:03   #12
Voxilla
Member
 
Join Date: Jun 2007
Posts: 218
Default

Interesting to see that so many slots get used, I thought all would end up in xyzw.
BTW with what tool did you get the assembly as ShaderAnalyser doesn't seem to be able to do this yet ?

The Julia sets are interesting indeed.
I've just uploaded an extended version that now also can do Julia calculations, thanks for the suggestion.
More explanation at the updated link.
Voxilla is offline   Reply With Quote
Old 09-Oct-2009, 12:40   #13
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 2,202
Send a message via Skype™ to rpg.314
Default

It'd be interesting to see the code generated by AMD's compiler for the scalar version. I'd like to know how good it is wrt extracting ILP from predominantly scalar code.
__________________
The views presented here are my own and do not represent my present or past employers' views in any way.
My blog
Eigen : simd done right
rpg.314 is offline   Reply With Quote
Old 09-Oct-2009, 15:47   #14
Voxilla
Member
 
Join Date: Jun 2007
Posts: 218
Default

Having looked a second time at the assembly output, I noticed quite a few loose MUL and ADD instructions, where in fact this could be done with a single MULADD instruction.

The reason appears to be that t = u*u - v*v + a is compiled in the written order.
This results in a MUL MULADD and ADD sequence.

Reordering the instructions to t = u*u + a - v*v gives a 20% speedup.
So I assume now this gets compiled into MULADD, MULADD, so one instruction less.

Computational throughput is now well over 1.7 TFLOP/s !

I've updated the viewer to version 1.4.

The scalar version is now about 3 times slower than the vectorized version.
This can be seen when zooming in to a full black view and in windowed mode to see the frames per second.
Voxilla is offline   Reply With Quote
Old 09-Oct-2009, 17:50   #15
OpenGL guy
Senior Member
 
Join Date: Feb 2002
Posts: 2,146
Send a message via ICQ to OpenGL guy
Default

Quote:
Originally Posted by Voxilla View Post
Interesting to see that so many slots get used, I thought all would end up in xyzw.
BTW with what tool did you get the assembly as ShaderAnalyser doesn't seem to be able to do this yet ?
Being an AMD employee, I have access to such tools
Quote:
Originally Posted by Voxilla
Having looked a second time at the assembly output, I noticed quite a few loose MUL and ADD instructions, where in fact this could be done with a single MULADD instruction.

The reason appears to be that t = u*u - v*v + a is compiled in the written order.
This results in a MUL MULADD and ADD sequence.

Reordering the instructions to t = u*u + a - v*v gives a 20% speedup.
So I assume now this gets compiled into MULADD, MULADD, so one instruction less.

Computational throughput is now well over 1.7 TFLOP/s !
I believe that left-to-right evaluation is the default when there are multiple operators with the same precedence. Your reordering of the calculations had a nice effect on the assembly output. The inner loop is only 64 ALU slots now and is mostly MADs.
__________________
I speak only for myself.
OpenGL guy is offline   Reply With Quote
Old 09-Oct-2009, 19:13   #16
Voxilla
Member
 
Join Date: Jun 2007
Posts: 218
Default

After reading the document about the R700 instruction set I got another idea to speed up the calculation.
There is an instruction MULADD_IEE_M2 which does dst = (src0*scr1 + scr2)*2.

This could be used for calculating

v = 2*u*v + b

after rewriting it as

v = 2*(u*v + b') with b'=b/2

However this did seem to have no effect, otherwise it would save another MUL instruction per iteration.
Voxilla is offline   Reply With Quote
Old 10-Oct-2009, 11:15   #17
Voxilla
Member
 
Join Date: Jun 2007
Posts: 218
Default

Here another incremental improvement.
The shader code looks a lot cleaner now, with scalar and vector versions almost identical.
The main loop is unrolled twice more and computational throughput is now over 1.9 TFLOP/s.

At 2560x1600 resolution frame rate never drops below 60 fps. So it can be useful to synchronize to vsync (pressing V key). Alternatively the maximum number of iterations can be doubled to 2048 with the M key.

The scalar version unfortunately now runs with rendering artifacts, I have no idea what could be the cause.
I've also included the optimization mentioned in the previous post in case the AMD compiler can make use of it.
This potentially could give another 25% speed bump.

Edit: Fixed bug that caused 32 iterations too much for non escaping points

Last edited by Voxilla; 10-Oct-2009 at 13:01.
Voxilla is offline   Reply With Quote
Old 10-Oct-2009, 17:47   #18
CarstenS
Just wondering
 
Join Date: May 2002
Location: Germany
Posts: 1,682
Default

Sounds great! Any chance you could do without unordered access view? That'd open up a wider target audience

For example me, sitting here stuck with an GTX280 in "the box" and an HD 5870 waiting in it's shipping box to be installed as soon as an unfortunately rather lengthy job (days, maybe another week) running on the rig has finished.
__________________
English is not my native tongue. Before being too nitpicky about my choice of words please consider the possiblity that I did not mean to say what you might have read into them and inquire before flaming.
CarstenS is offline   Reply With Quote
Old 10-Oct-2009, 18:09   #19
Voxilla
Member
 
Join Date: Jun 2007
Posts: 218
Default

Quote:
Originally Posted by CarstenS View Post
Sounds great! Any chance you could do without unordered access view? That'd open up a wider target audience
I had a try at it to get it working with D3D_FEATURE_LEVEL_10_0 but this would not let me create the device with DXGI_USAGE_UNORDERED_ACCESS, and this is needed to be able to directly write to the backbuffer. Maybe there is another way to get the data there, via copying from another buffer, any suggestions ?

In principle most of this should be possible without compute shaders using pixel shaders, but than it get's complicated, certainly the vectorized version as this one outputs 4 pixels at once.

Also now I'm using doubles to some extent, only possible with shader version 5.
As soon as the HLSL compiler doesn't crash anymore I plan to have a full doubles version for allowing deeper zoom in, this probably will take till the next release of the DX SDK.

PS Maybe someone of Microsoft is reading this, I willing to donate this as a sample for the next SDK

Last edited by Voxilla; 10-Oct-2009 at 18:19.
Voxilla is offline   Reply With Quote
Old 10-Oct-2009, 18:17   #20
CarstenS
Just wondering
 
Join Date: May 2002
Location: Germany
Posts: 1,682
Default

Ok, so I'll have to be patient until i can put my HD 5870 to work. Ah well, more stuff to look forward to!

Unless of course, someone comes up with a very clever solution to this.
__________________
English is not my native tongue. Before being too nitpicky about my choice of words please consider the possiblity that I did not mean to say what you might have read into them and inquire before flaming.
CarstenS is offline   Reply With Quote
Old 11-Oct-2009, 14:12   #21
Humus
Crazy coder
 
Join Date: Feb 2002
Location: Stockholm, Sweden
Posts: 3,134
Send a message via ICQ to Humus Send a message via MSN to Humus
Default

Quote:
Originally Posted by Voxilla View Post
I had a try at it to get it working with D3D_FEATURE_LEVEL_10_0 but this would not let me create the device with DXGI_USAGE_UNORDERED_ACCESS, and this is needed to be able to directly write to the backbuffer. Maybe there is another way to get the data there, via copying from another buffer, any suggestions ?
Since only RWBuffer is supported and not RWTexture, and the backbuffer obviously is a texture, you'll have to use an intermediate buffer, then use a simple pixel shader to copy from the buffer to the backbuffer.
__________________
[ Visit my site ]
I speak for myself and only myself.
Humus is offline   Reply With Quote
Old 12-Oct-2009, 04:52   #22
hoom
Senior Member
 
Join Date: Sep 2003
Posts: 1,790
Default

This sounds awesome, I hope you can get it to work on DX10/10.1

Good to see the ATI compiler making good use of resources

Reading the link to the Keenan Julia set, he says that was a raytraced output, is this too?
__________________
However, the above is the heart of the foreskin capacitance
hoom is offline   Reply With Quote
Old 12-Oct-2009, 08:26   #23
Voxilla
Member
 
Join Date: Jun 2007
Posts: 218
Default

Quote:
Originally Posted by hoom View Post
This sounds awesome, I hope you can get it to work on DX10/10.1
I'm willing to get it work on DX10, give me some time.


Quote:
Reading the link to the Keenan Julia set, he says that was a raytraced output, is this too?
Yes it does ray tracing, the latest version now also can do self shadowing.
As the object is a fractal the ray tracing algorithm is very different from ray tracing polygon objects though.
Voxilla is offline   Reply With Quote
Old 13-Oct-2009, 06:39   #24
EduardoS
Junior Member
 
Join Date: Nov 2008
Posts: 78
Default

What if you change a bit the boolean math?
Like, treating inside as int or even float instead of bool?
Maybe something like, instead of:
while ( any(inside && counter!=0));
try
while ( max4(inside * counter) != 0.0f);

Just curious, it looks there is too much ALUs for it...
EduardoS is offline   Reply With Quote
Old 13-Oct-2009, 06:53   #25
EduardoS
Junior Member
 
Join Date: Nov 2008
Posts: 78
Default

Can't edit?
Didn't saw the obvious...
while ( dot(inside, counter) != 0.0f);
EduardoS is offline   Reply With Quote

Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 10:24.


Powered by vBulletin® Version 3.8.4
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.