DX11 Julia 4D

I see what you mean. This fractal however has infinite detail, the more accurate you try to render edges the more of them you will find. As the current rendering is only a relative smooth approximation it may be worth investigation.
Of course, but we're not doing infinite iterations to get that infinite detail. We're just making pretty pictures :)
The problem can be considered rendering of an isosurface whose surface is at a certain distance 'D' from the fractal. To make it more accurate, a more accurate intersection with the isosurface would be needed. This could be done by an iterative approach similar to high quality isosurface volume rendering. Also for getting more accurate normals this would be needed. Once this is done right, adaptive supersampling near edges could be done as you describe.
Yes, the isosurface method is possible, but you can still do adaptive supersampling with this unbounded volumes raytracing method.

Now that I've thought about it further, you don't even need DX10.1 or AA support (duh, why use hardware AA if there's no polygons :oops: ). Just optionally supersample in the shader itself if either of those conditions are met, and average the samples, maybe with gamma correction.
 
It seems people using Win7 are running into trouble with my DX11 thingies.
I have no Win7, so I can not figure out what goes wrong.

Could someone try to debug this on Win7, the source code is included ?
I had the same issue. Found that if I alt-tab then I can get the app to show up. No time to debug it, however.

One thing I did find was that if you make a shader that doesn't compile, then you can't recover from the black screen :(
 
I had the same issue. Found that if I alt-tab then I can get the app to show up. No time to debug it, however.

One thing I did find was that if you make a shader that doesn't compile, then you can't recover from the black screen :(

The issue might be with the HLSL compiler, at run time the shader is compiled with whatever HLSL compiler is available. The Win7 compiler might have issues with my shader code. Could we worked around by precompiling the shaders.
 
Doesn't work on the following

Win7 x64 RC (build 7100)
Catalyst 9.11 OpenCL Driver 8.67-091008a-089950E-ATI
4850 512MB
 
Last edited by a moderator:
Black screen ("QJulia4D.exe has stopped working") for me too:

Windows 7 x86 build 7600
Display Driver 8.66.6-091006a-089804E
5870 1GB

Which is a pity, because I'm dying to see this (and mandelbrot) in action...
 
I've added another DLL to the download, namely D3DCompiler_42.dll.

It is a DLL that is needed to compile the shader source code.
In case you have not installed the August 2009 DX SDK, this may be the reason causing the renderer to crash.
 
Excellent program!

Works perfecty fine on W7 x64 and HD5870 Cat9.11beta!

PS. A short file with descryption of what each key do would be welcomed!
 
Thread group issues...

I was going to do a (what I thought) obvious CS optimization to this, but it's not behaving as expected. What I wanted was to have each thread process a small (4x4) group of pixels, sharing most of the work between them.
However, just making the basic 4x4 loop in the shader and reducing group size to 4x4 too drastically increases compilation time and reduces performance.
Seems like even the [loop] attribute won't save me for the insane unrolling, so to keep program size down (seem like big programs aren't as much of an issue as in r700?) I have to do the old tricks of hiding the loop count from the compiler...

For the performance thing I think it has to do with the thread grouping, but seems like too much slowdown.
I don't see the hardware implications of thread groups documented, but from the concepts of LDS and GDS I expect it to be like this: Each thread group is running on a single SIMD (of which I have 10) and this SIMD can't get a new group until all threads in the previous one are done (however this should be possible if LDS is not used?) - so with a few long-running threads my SIMDs will be starved in the end of a group.
So groups of 16 or below would be really bad and generally we want as many as possible in each group, ie the maximum 1024 (while still having a decent number of groups, but something like a few hundreds should do) - more threads than possible for the hw (register file) shouldn't be a problem.

However, the original 16x16 version is twice the speed of a 32x32 version - why is that??

In numbers I have for my test frame:
1 pixel/thread, 16x16 threads/group: 34fps
1 pixel/thread, 32x32 threads/group: 19fps
16 pixels/thread, 16x16 threads/group, dynamic loop: 17fps
16 pixels/thread, 16x16 threads/group, unrolled loop (grrr): 17fps (so just a compilation time issue)
16 pixels/thread, 4x4 threads/group: 6fps


And what would be right way to do things if the above isn't feasible? I imagine something like 1-pixel threads, 16+ 4*4 pixel blocks in a group and then letting one thread from each block do the common work, saving it on the LDS and then synchronize the whole group. However even with the maximum possible 64 blocks (1024 threads) drastically different workloads for the 64 "initial pixels" could waste a lot of cycles in the synchronization.
 
I don't see the hardware implications of thread groups documented, but from the concepts of LDS and GDS I expect it to be like this: Each thread group is running on a single SIMD (of which I have 10) and this SIMD can't get a new group until all threads in the previous one are done (however this should be possible if LDS is not used?) - so with a few long-running threads my SIMDs will be starved in the end of a group.
I think I've seen someone from AMD recommend 64 for best performance somewhere on the AMD forums - that's when dealing with Brook+ or IL.

On ATI regardless of whether LDS is used the group size is taken at face value by the hardware (for Brook+ and IL, presumably the same for CS and OpenCL). There are tricks derived from this.

So yes, it's easy to starve the SIMD with a group size of 1024. Since that's strand count, that's 16 hardware threads on ATI (1024/64). With a smaller group size there can be multiple groups in flight. Compute shader mode on ATI always places an upper bound of 16 hardware threads, i.e. the sum of all strands in all in-flight groups comes to <=1024.

You also have to take into account the register allocation. If group size is 1024, but the register allocation means only 8 hardware threads (512 strands) can be in flight, you are preventing the hardware from running two groups on the SIMD in parallel, to hide the inter-group cut-over latency. You'd be better off with a group size between 64 and 256.

In effect you are issuing a barrier synch on the completion of each group. This synch is soft if more than one group is running on the SIMD, but with only 1 group the synch becomes hard (complete flush).

So taking your data:
  • 1 pixel/thread, 16x16 threads/group: 34fps - 256 strands per group, 16 hardware threads per SIMD, soft synch
  • 1 pixel/thread, 32x32 threads/group: 19fps - 1024 strands per group, 16 hardware threads per SIMD, hard synch
  • 16 pixels/thread, 16x16 threads/group, dynamic loop: 17fps - 256 strands per group. Without knowing the register allocation for 16 pixels per strand it's not possible to determine the number of hardware threads per SIMD or whether it's hard or soft synching
  • 16 pixels/thread, 4x4 threads/group: 6fps - 16 strands per group is running each hardware thread only 1/4 occupied. Should be a minimum of 8x8.
Jawed
 
Thanks, I think I got most of it after reading up on the strands/threads thing.

So you are saying that the scheduler will only send a whole group into flight at a time - so if there's only room for one group it will have to end all 16 threads before a new group can go in, while if we have just 2 groups running on the simd, there should always be a decent number of threads in flight. I just thought it could have more than just 16 in flights (when register pressure allows). Now, what if register pressure only allows for 8 threads in flight and I'm doing a sync of all 16 threads (1024 groupsize)? :)

So, that more or less explains the 2 first results, and the last also makes sense, as maybe we can run 16 groups at the same time, but still each group will only run at 25% efficiency. 8x8 gives the same result as 16x16 btw, which makes sense now.
For the 3rd I think the problem is the branch granularity. If strands are selected in a linear fashion from the 2D thread id, I got myself a branch granularity of 256x4 pixels(!) and raymarching with distance functions is probably quite bad for the branching (every pixel gets a different number of iterations, and a few gets a lot). Is it well defined how strands go into threads? I imagine it could even make sense doing some special spatial mapping here, to group pixels of (predicted) similar complexity. (I have been doing a lot of pixelshader raymarching/spheretracing before, but in a pixelshader there isn't much choice about these things)
 
Last edited by a moderator:
So you are saying that the scheduler will only send a whole group into flight at a time - so if there's only room for one group it will have to end all 16 threads before a new group can go in,
I guess the thread local share memory is always implicitly setup (whether used or not), creating the synch - any strand can read any part of that memory.

while if we have just 2 groups running on the simd, there should always be a decent number of threads in flight.
Yeah.

I just thought it could have more than just 16 in flights (when register pressure allows). Now, what if register pressure only allows for 8 threads in flight and I'm doing a sync of all 16 threads (1024 groupsize)? :)
I honestly don't know. I'm not sure what CUDA does. I've just remembered that CS4 can only support 768 strands, while CS5 has 1024.

So, that more or less explains the 2 first results, and the last also makes sense, as maybe we can run 16 groups at the same time, but still each group will only run at 25% efficiency. 8x8 gives the same result as 16x16 btw, which makes sense now.
That's cool. Existing CUDA devices have a limit of 8 blocks (groups), which means a group size of only 64 would artificially lower the number of strands in flight. I guess the limit is 16 on ATI (1024/64=16). So a group size of 16x16 is a nice, square, middle ground.

For the 3rd I think the problem is the branch granularity.
16 pixels per strand could make for a large register allocation. The worst case would be only 2 hardware threads in flight on ATI.

If strands are selected in a linear fashion from the 2D thread id, I got myself a branch granularity of 256x4 pixels(!) and raymarching with distance functions is probably quite bad for the branching (every pixel gets a different number of iterations, and a few gets a lot).
Which is why I'm keen to see what kind of performance difference there is between ATI and NVidia - because nested control flow should magnify the difference in incoherence penalties quite dramatically :p

Is it well defined how strands go into threads? I imagine it could even make sense doing some special spatial mapping here, to group pixels of (predicted) similar complexity. (I have been doing a lot of pixelshader raymarching/spheretracing before, but in a pixelshader there isn't much choice about these things)
The allocation is linear - the various compute APIs don't expose the hardware's thread size (except for CUDA). In pixel shading it never mattered because pixels could never share data with each other (except the gradient across a quad of pixels for texturing purposes). Once data is shared then you need an addressing scheme that is consistent no matter what the hardware's thread size is (64 per thread, 32 per thread etc.).

Jawed
 
Black screen on the following with this Julia demo:
Windows vista (6002) service pack 2
HIS Radeon 5870

black screen and alt tabbing doesn't help. Can only kill the app with ctrl-alt-del

I can run the Mandelbrot viewer and the waves demo just fine though.
Also I got the ati dx11 demos working and the heaven bench works with tesselation and all that.

can you make this start windowed? maybe that might help?

edit: I can confirm it works in windowed mode. This is a bit weird but for some reason alt+enter worked twice, when I timed it just right... On the first time I think I pressed esc to exit... On the second try I then pressed the minimize button on the window which totally killed my system performance, like 10 seconds per frame on desktop, where the window would very slowly eventually minimize itself... but performance remained such that I could not even ctrl-alt-tab anymore and had to hard reset the whole system. Usually the alt enter does nothing for me and when I tried it for the third time I could not do it again.
 
Last edited by a moderator:
can you make this start windowed? maybe that might help?

Ok, I've uploaded a 1.32 version that starts windowed, I hope this will solve all Win7 issues.

Also you can now toggle between compute and pixel shader, with the P key to see which one is faster.
 
Ok, I've uploaded a 1.32 version that starts windowed, I hope this will solve all Win7 issues.

Also you can now toggle between compute and pixel shader, with the P key to see which one is faster.

Works fine on my Win7 box with a 5870, interestingly enough the pixel shader version seems to run a fair bit faster, shouldn't the compute shader be faster?
 
Works fine on my Win7 box with a 5870, interestingly enough the pixel shader version seems to run a fair bit faster, shouldn't the compute shader be faster?

That's what one would expect. With compute shaders you can try to make it faster by changing the size of the thread group, I could not make it faster as it is.
Maybe there is something with the compute shader thread scheduling that is not as good yet as with pixel shaders, but I have no idea what.
 
I wonder if different 2D shapes for the thread group make a difference. Something I forgot about earlier when I was thinking about Z versus linear ordering.

Jawed
 
Ok, I've uploaded a 1.32 version that starts windowed, I hope this will solve all Win7 issues.
Yup, it's working for me now, although "alt-enter" to full screen still wrecks the performance. ;)
My brand new HD5870 is happy now! :D
Also you can now toggle between compute and pixel shader, with the P key to see which one is faster.
Looks like PS mode is faster for me.
 
Last edited by a moderator:
I wonder if different 2D shapes for the thread group make a difference. Something I forgot about earlier when I was thinking about Z versus linear ordering.

Jawed

Ah yes, that could explain it, pixel shaders use some kind of space filling curve like recursive Z with high spacial coherence, the compute shader probably executes a thread group one row after the other.
 
Back
Top