DX11 Julia 4D

I wonder if different 2D shapes for the thread group make a difference. Something I forgot about earlier when I was thinking about Z versus linear ordering.

Yeah.. also tried that when verifying branch granularity suspicion:
Pixel shader version: 88 fps
16*16 group size (normal version): 72
8*32 group size: 96 :D
and for the fun of it 64*4: 50

Ie. the normal version has a 16*4 pixel branch granularity, while the modified is more optimal square 8x8.
 
Pixel shader version: 88 fps
16*16 group size (normal version): 72
8*32 group size: 96 :D
and for the fun of it 64*4: 50

Ie. the normal version has a 16*4 pixel branch granularity, while the modified is more optimal square 8x8.

Indeed, the 8*32 makes it slightly faster as the pixel shader, I'll put your find in a next version. ;)

This manual group size tuning leaves much to be desired in my opinion, on another GPU another group size may be optimal. There should at least be a 'auto' group size mode, where the driver figures out automatically what works best.

Edit: On closer look depending on the orientation either the compute or pixel shaders becomes faster, but not far apart.
 
Last edited by a moderator:
It all works now, even if I alt enter to full screen it still works.

Oh and may I just add: There are a total of 7 directx 11 applications that I know of. There's Battleforge that does a little bit of SSAO, then there's the two AMD tech demos, then there's the Heaven benchmark... and then there are these 3 compute shader thingies of yours...

What I'm saying is: Thanks! There isn't a whole lot of dx11 stuff to test as of now so you doing these interesting things is helpful to say the least :)
 
A thread group of 8x8 should also work.

It'd be interesting to see if 32x8 is faster or slower than 8x32.

Manual thread group sizing is only the start (you could try creating an empirical auto-sizer). Welcome to the woes of compute on GPUs. Your more advanced applications will also have to worry about coherency in shared memory accesses and coalescing of global memory accesses (if the hardware does it at all).

Jawed
 
Here some more frames per second for different group sizes:

8x32 48
8x24 45
8x16 40

8x8 25
32x8 24

Some more fiddling with less width groups, makes it even faster than Psycho's 8x32

4x32 51
4x64 59

2x64 60
2x128 61

For reference pixel shader is at 49
 
I dunno whether to laugh or cry :???: 2-wide is a real shocker.

Is shape of the thread group interacting with the data (some view-dependency has been reported already)? The shape interacts with control flow incoherency penalties as reported by EduardoS for Mandlebrot:

http://forum.beyond3d.com/showthread.php?p=1351048#post1351048

I dare say it's worth testing 2x256. And, erm, I suppose it's worth checking that 1x64, 1x128, 1x256 and 1x512 are all crap.

Also, Psycho's experimenting with multiple pixels per thread (strand), which adds a couple of dimensions to the problem of identifying the best shapes to use :p

Jawed
 
Jawed, once in a while you should get yourself some up to date gear to do some practical experiments.

The rendering rate does not seem to depend too much on the 3D orientation of the fractal.

Here some more data
2x256 40

1x64 47
1x128 57
1x256 51
1x512 28

Try to understand that :oops:
 
Jawed, once in a while you should get yourself some up to date gear to do some practical experiments.
:LOL: I'm feeling the pain of my PC, components of which are 5 years old (that's excluding the sound card, which goes way back) and I'm scrimping towards putting together a whole new system.

Try to understand that :oops:
Well, I suppose it's a relief that 1x?? isn't fastest of all, but the fact that 1x128 is faster than most other options is pretty intriguing.

I wonder if driver immaturity is causing some of this?

The register allocation of the PS version is 18 vec4s, which, assuming the CS version is similar, means the hardware can't run a full 16 hardware threads on each SIMD.

This might imply that group size 512 is struggling with one group of 8 hardware threads but another group of only 6 (assuming 14 threads in total can be in flight). I don't know how the hardware actually schedules that second group.

With group size 256, the hardware is trying to support 3 groups of 4 hardware threads and 1 group of 2. Again, that seems problematic.

With group size 128, the hardware is trying to support 7 groups of 2 hardware threads. While that might explain 1x128 being the best performer of the 1-wide set, 2x128 is the best performer - and that has problematic grouping due to group size of 256.

Newer versions of DirectCompute will increase the maximum size of a workgroup, increasing the choices. Coupled with changing architectures and competing cards as time goes by, it's quite a problem.

---

Looking at the code I notice that not all literals are floats and I'm wondering if some are treated as double by the compiler. Can pixel shader use double? I'm doubtful, but I really don't know. I'm wondering if there's a chance the compiler is using double in the CS version and making it slower than it need be.

The reason I raise this is simply that mucking about with Brook+ I've learnt to be really careful about not accidentally using doubles. e.g. intersectSphere is 13 cycles as written and 7 cycles using float literals (that's isolating the function by testing it as Brook+).

Jawed
 
I've updated to version version 1.4.

Now compute shader is slightly faster than pixel shader by using a 4x64 thread group.
Mouse interaction is improved by using the common virtual trackball method.
Zooming in and out is now possible with the mouse wheel.
 
Yea, CS is much faster now for me on my 5870, than PS.
Is there a way to implement some sort of AA sampling to the generated frames? Pixel aliasing on the fine structures is quite pronounced. ;)
 
The register allocation of the PS version is 18 vec4s, which, assuming the CS version is similar, means the hardware can't run a full 16 hardware threads on each SIMD.

But our theory was that it only schedules full groups, so for instance at size 256 we will have just 3 groups running at the same time. As we have no memory access (except the final write) all we need to utilize the hardware are 2 interleaved hardware threads running. And for that 2 groups should be enough (in theory at least). And at least this shouldn't be affected by the 2D layout, only the total number.

The performance of those narrow group layouts is really puzzling, branch coherence should definitely be best in 8x8 pixel blocks, so how 2x128 can ever be better than 8x32...?
Maybe it's time to check if the 2D index -> hardware thread index mapping is as expected..

Is there a way to implement some sort of AA sampling to the generated frames? Pixel aliasing on the fine structures is quite pronounced.

We could always add supersampling, but that would be pretty costly. For branch granularity it would be best to have just 1 sample/thread, and then maybe add them up in LDS. A more adaptive aproach is problematic because of the branching. Sphere tracing is already ALOT more costly on the edges (which is where we need the additional samples), so adding more samples to the already slowest strands would be quite ineffective. Could be interesting to try append buffers to schedule additional work for later.

Got around to write new pixel-work-sharing versions, after the initial 16pix/thread idea failed on the branch coherency.
First a bit complicated single pass scheme, that *should* have decent performance: each group is 8*32+16 threads (no 2d idx here), where the initial hard work (calculating lower bound on depth for 4x4 pixel block) is done in the last 16 threads (8*32 pixels = 16 4x4 pixel blocks). After this we synchronize through LDS with the remaining 256 threads so they can perform a lighter work than usual. This performs slightly worse than the plain version, even though less work is done.
Ofcourse the initial hardware thread is only running 16 strands, but it's only a smaller part of the workload (we have 4 more hardware threads in each group). And as I want at least 2 groups running I can't go for a larger group size. Still, compared to the version below it's less than expected.

So I had to write a regular 2 pass version instead, which could just as well be done in pixel shaders (ran into a lot of stupid restrictions with the new buffer types btw).
An inital pass writing out lower bound for depth for 4x4 blocks to a quarter sized structured buffer and a final pass starting at those positions instead of the camera. This gives in the range of 20-40% more performance for this scene. Again, looking at the number of distance evaluations (just plotting them per pixel) it looks like my FLOPS is going down. The first pass is taking <10% of the time.

btw, "a= (b>0)?0:complex()" is always doing BOTH branches, while "if (b>0) a=0; else a=complex();" is not..
 
Last edited by a moderator:
That branch granularity thing is tricky..
I'm rendering simple blocks, with just a few very slow pixels (/strands) and performance for 512x512 is about 50fps if all slow and ~7500fps if fast..

So, 8x32 group size with 8x8 pixel grid (ie a slow pixel in the corner of each 8x8 block) is ~50 fps as expected.
16x16 group, 8x8 pixel grid: 100 fps (sure - every 2nd hardware thread is running fast, while every other has 2 slow pixels)

But:
4x64 group, 4x16 grid: 50
4x64 group, 8x8 grid: 50
4x64 group, 8x16 grid: 50
4x64 group, 16x4 grid: 100

2x128 group, 4x16 grid: 50
2x128 group, 8x8 grid: 100

Hmm.. is two groups executing interleaved, dependent threads? .. lets fill up a whole SIMD with one group:
8x128 group, 8x8 grid: 50
8x128 group, 8x16 grid: 100
8x128 group, 16x8 grid: 50
Still we have this horisontal dependency!?

At least it's somehow in line with the 4x64 optimal size for julia4d - if this really means a 8x16 branch granularity, it's ofcourse about as good as the 8x32 group's 16x8.
 
Last edited by a moderator:
But our theory was that it only schedules full groups, so for instance at size 256 we will have just 3 groups running at the same time. As we have no memory access (except the final write) all we need to utilize the hardware are 2 interleaved hardware threads running. And for that 2 groups should be enough (in theory at least). And at least this shouldn't be affected by the 2D layout, only the total number.
Agreed, the theory is that only complete groups can be in flight.

On ATI all clause switches incur a latency penalty. So a pure ALU kernel with only 1 clause has no clause switch latency to hide. But each additional clause adds some unknown number of extra cycles. With more than 2 hardware threads this latency can be hidden. I'm unclear if 3 is enough or 4 are needed.

Apart from the latency induced by the write at the end, there'll be a number of additional clauses corresponding with the control flow. Each branch incurs latency for two reasons: a clause switch is performed and the sequencer has to evaluate the predicate and determine where to pass control flow. I don't know if these latencies are additive - I expect so, as I presume the clause switch latency has a component of register fetch latency.

I don't know how many threads, minimum, are required to hide control flow latency. e.g. it might be 6.

Jawed
 
Hmm.. is two groups executing interleaved, dependent threads? .. lets fill up a whole SIMD with one group:
8x128 group, 8x8 grid: 50
8x128 group, 8x16 grid: 100
8x128 group, 16x8 grid: 50
Hmm, this is interesting. Presuming that the register allocation is nominally too high for 16 hardware threads, I think this might imply that register spill is being used to allow the entire 1024 strands to be in flight. Or, perhaps the compiler/driver sees that LDS is not being used, so relaxes the constraint on launching the entire group atomically.

Jawed
 
btw, "a= (b>0)?0:complex()" is always doing BOTH branches, while "if (b>0) a=0; else a=complex();" is not..
Not really, in fact, the spec states complex() should only be called if !(b>0), if neither path have side effects the compiler may execute both to avoid a branch or to use a conditional move.
 
Back
Top