Sure, but that's my point: it's precisely due to the languages trying to "abstract" over SIMD widths. Obviously if you had a purely MIMD machine then lane swizzling *wouldn't* be free or even make much sense. However the reality is we can probably put a reasonably minimal bound on the minimum SIMD widths of these GPUs... 4 is practically guaranteed and even 16 is probably reasonable.
LRBni doesn't provide arbitrary lane swizzle, does it? Swizzles are restricted to quad-lane neighbourhoods. Scatter/gather via a cache line is required.
As for SIMD width, 16 looks like a safe minimum for a few years yet.
Similar programming model problems arise when trying to implement persistent threads in current APIs.
I dare say that seems likely to be an explicit feature of D3D12. Fingers-crossed.
Debatable... we're already seeing legacy problems with code compiled to static shared memory sizes. Equivalently the most efficient code chooses block sizes related to the number of cores on the GPU. These are significant problems going forward.
I was referring to cruft in the hardware implementation.
Cruft in the software is an on-going problem. The size of shared memory is just one variable out of many that results in "over-optimisation" for today's hardware. Cache-line size, L1$ size, SIMD width, register file size, DDR burst length and DDR banking are some others.
NVidia, with CUDA, has attempted to obfuscate some parameters to prevent "over-optimisation". Guaranteeing warp size of 32 for the foreseeable future is good, though shared memory in Fermi (or at least GF100) has new bank conflict issues.
Even when hardware parameters are obfuscated developers are liable to dig, resulting in potential "over-optimisation".
Glad you like it
To be clear, I'm not saying that the GPU models are a failure or anything - obviously this stuff would only be possible with them! I'm more commenting on the things that keep coming up and appear to be problematic moving forward... just still that I think we should be focusing our innovation and research efforts on improving. I definitely do not think the current GPU computing models are near an end state and I think most people would agree with that.
I dare say game developers are mostly still trying to catch up with D3D10(.1) and so their input on what's needed for D3D12 is limited. Obviously there are people at the cutting edge and maybe it's best that there's only a few of you stirring the pot.
I sure hope so! In my presentation I unsubtly hinted to the audience to find me a faster implementation so maybe with all those smart heads there and a touch of motivation/competition we'll get something
I still haven't looked at your code, but these are three generic ideas:
- hash by SIMD lane - this is similar to the SR (globally shared register) technique that I referred to before on ATI.
e.g. with a maximum of 2048 bins and shared memory capacity of 32KB, you can hash by a factor of 16. The obvious key is (absolute work item ID & 15). So you can do (ZCoarse << 4) + (WorkItemID & 15) to generate a collision-free atomic.
Obviously with more bins you'd hash by a smaller factor and so would suffer some collisions.
- tile by work item - generally you should be fetching multiple samples per work item in a coherent tile, between atomics.
There are two benefits here. First the hardware prefers to fetch coherently (e.g. gather4) and ATI likes to fetch wodges (i.e. 16x gather4) and has the register file capacity to do so. Second by tiling like this you automatically serialise ZBuckets that, being neighbours in 2D, are likely to collide.
e.g. each work item fetches an 8x8 tile.
- scatter work items - to improve the serialisation of ZCoarse (if 1 and 2 don't eliminate collisions), you can make each work item discontiguously sample, reducing the chances of collisions amongst neighbouring work items.
e.g. if you are doing 8x8 tiles per work item from technique 2. then work item 1 is offset from work item 0 along the diagonal, starting at 8,8, work item 2 starting at 16,16 etc.
Undoubtedly the different cards will want different tunings for techniques 2 and 3. 2 and 3 need to be balanced against each other, with 2 improving cache coherency while 3 spoils it.
Not really doing any other histogram-related stuff right now... I didn't mean to imply that I was. I've been doing mostly deferred shading stuff lately (as per the other demo) which brings out its own interesting hardware and software puzzles
Check my presentation and demo above for the initial batch, although there's nothing quite as drastic as the 4x perf difference in the SDSM histogram path (although frankly I do find the MSAA scheduling results to be pretty interesting in terms of programming models and hardware scheduling moving forward).
I've only had a quick look at that...