f32to16 + f16to32 could be of course used inside the shader to store packed data manually in integer registers. You'd pay a few extra instructions per light source to convert the accumulators back to fp32, do fp32 math and then convert back to fp16. This would give you some GPR gains but would also increase the ALU count (instead of decrease it). On modern GPUs you seldom are purely register or ALU (or BW) bound. This kind of trick might be benefical in this particular case (light accumulation variables) but if you need the fp16 varible more often, the back and forth conversion costs will become a problem (without native fp16 ALUs you need to convert a lot more).
We have used this kind of tricks to conserve LDS space, but not to conserve registers. Packing two 16 bit integers together is also quite nice on GCN, since it has (full rate) combined mask+shift instructions.
In this case I'm thinking of packing three 10-bit integers to save four overall. But anyway, if your register allocation is at 110, then saving four registers isn't going to magically get you down to 84, which is what you need to get a whole extra hardware thread on the ALU.
X360 shader compiler had a [isolate] tag to force it to compile blocks in vacuum (forcing register life time inside the block). You could also use this on variables to force the compiler to recalculate it (or reload from constant memory / L1). I kind of miss this.
Oh man, I would love that.
However the GCN microcode is very nice and clean, so I could see myself hand writing some performance critical shaders if the compiler doesn't cooperate. In the old DX9 (SM 2.0) era, I wrote all the shaders by hand in DX assembly. As the shaders were limited to 64 instruction slots this was nesessary to keep the slot count down to fit your lighting shaders to the GPU (we were among the first developers to ship a game with deferred rendering).
I have not so fond memories of debugging complex memory addressing in IL by writing stuff to a buffer instead of the kernel's output. Assembly, on PC, is still subject to the whims of the driver, so the only real solution seems to be writing/patching the ELF format. Are there whims to deal with on console if you write assembly?
Loops in general are a good way to force the compiler to keep the variables inside a region.
Yes, sometimes it even makes sense to have two or more loops in sequence, doing exactly the same work (but on different data, e.g. three loops one for each of red, green and blue).
Barriers can be also used to prevent moving data loads (= GPR allocations) over them.
Those or mem_fence hardly work on OpenCL (nested-ifs are much better).
Other techniques I forgot to mention last night:
- reduction - if loops are fixed in iteration count and can be a power of two, then use multiple work items to each compute a distinct loop iteration, and reduce at the end
- read-ahead - at the cost of some GPRs (e.g. if you have 100+ and have no chance of getting down to 84) then read data at the start of the loop that will be computed on the following iteration. Works very nicely as long as there's substantial stuff in the loop
It's a good start towards the right direction. On-chip communication between different shaders cooperating with each other in a fine grained manner would be the final goal. Hopefully someday we get a flexible system that allows us to define our own stader stages and communication between them (in a way that allows the GPU to load balance well between the compute units).
To catch up with what could have been possible with Larrabee years ago
I didn't like the DX11 tessellation design, since it added two additional predefined shader stages (PS+VS+GS+CS was already quite many). This clearly pointed out the need for user defined stages and communication. I would have preferred to get something more flexible instead of lots of API bulk designed around a single feature. If predefined stages will not go away, we will soon have 10+ different stages with slightly different semantics (and a HUGE bloated API).
I can't remember seeing an alternative tessellation API that would have been demonstrably as capable and cleaner. Anyway, I'm not sure if there's likely to be yet more pipeline stages.
There could be alternative pipeline styles (e.g. "ray trace").
What we need is to be able to lock L2 cache lines (or call them last-level cache lines if you want to make it generic) for on-chip pipe buffering. GPUs have loads of L2. NVidia is effectively doing so as part of load-balancing tessellation.