Jawed
Legend
Sigh I totally forgot about shuffle - I remember being excited about that, onceSwizzles to the quads are free, but you can shuffle the 128-bit blocks using an instruction. That's all you need for horizontal reductions/address comparisons. Furthermore even if it did not, any swizzle neighbourhood (even 4) is better than none.
Well both GPUs support concurrent compute kernels now (still pretty fuzzy on what that really means on ATI, though) so if it isn't in 12, well there's always LRB...I'm not as convinced that there's quite enough buy-in on this yet, but maybe OpenCL's first pass at a task system will motivate some innovation in that space.
I agree generally.Sure, although it's worth noting that increasing cache sizes behaves better with legacy code than scratch pad memory (which just goes unused).
The learning curve is still too steep even for the hardware designers (local atomics and tessellation being great examples, currently), so I think this needs repeated revisiting. (Then there's the physics of what's buildable, which I think is why tessellation is poor in ATI currently - not sure what the deal is with local atomics in NVidia.)I can buy the "over-optimization" argument on CPUs where you're talking about single-digit % increases in a lot of cases (sometimes more, but on balance) but on GPUs you're often talking about an order of magnitude... that's too much performance to leave on the floor for "portability".
Some of Intel's Terascale work looks more like Transputer than Larrabee (maybe that's me in wishful thinking mode) and there's quite a vocal contingent who think the cache architecture of Larrabee isn't viable in the long term, where we're talking hundreds and thousands of cores.
So if we're really going to talk about programming models that can last longer than 10 years, then at best one can only hope for "local cache", whatever that actually means when programming 1000 cores.
Well my posting this morning was the summation of my thoughts as I fell asleep last night before having a proper look at your code.Right but the bins aren't just sums - they are 7 32-bit values each!
You reported about 1.4ms to compute the histogram on ATI, as I understand it (I've no idea what kind of overhead that has...). That's about 380 million cycles. For about 2 million samples? (Or is this 4xMSAA, 8 million samples?)There's not enough shared memory to amplify them even 2x at the moment. You might be able to pack a few things into half-floats and get a 2x spread but you're definitely not going to get to one per SIMD lane!
So around 200 cycles (or 50 for 4xMSAA?) per sample?
The inner loop is 31 cycles according to GPU Shader Analyzer.
Yeah, I was assuming a "persistent" kernel.Yes mostly already done. Tile sizes are decoupled from the compute domain and chosen so that there are ~ as many tiles as required to just fill the GPU. This is important to minimize global writes/atomics traffic at the end of each work group.
As I haven't done this stuff for real I don't understand the issue here. But I have a feeling I'm seeing coherent fetches where there aren't anyI played with strided vs "linear" lookups across the thread group but the latter were generally faster. If NVIDIA's coalescing logic remains the same then the latter will definitely be faster. I haven't played with using gather4 explicitly though... it's quite an annoying programming model - if they want the access like that they have free reign to reorganize the work items in the group.
I made that point partly to defend against the cache thrashing that technique 3 engenders. On ATI clause switches cost significant cycles, so a single clause of 16 TEX is preferable to 16 clauses of 1 TEX - the latency of the latter and the increase in cache thrashing it directly causes would hurt 3.The lookup from the Z-buffer is *definitely* not the bottleneck though so I'm hesitate to try and optimize this much more.
Does the entire histogram need to be generated each frame?Yup I played with more random sampling to reduce collisions and it does work but with one huge problem: you can't have a massive performance falloff in the case where EVERY pixel on the screen collides. Put another way, worst case performance is what matters, not making the easier cases faster. This is a key point for game developers and one that I've heard often. Thus I haven't put a lot of effort into making the fast cases faster
As the camera translates the histogram is, in general, shifting coherently, isn't it?