In this case I'm thinking of packing three 10-bit integers to save four overall
FP10 is slightly too inaccurate for accumulation purposes. I would prefer to accumulate in FP16 when I output in FP11. In this case FP16 causes practically no loss, but FP10 / FP11 would both cause loss. It is also debatable whether FP11 output (storage) is enough for PBR (with realistic dynamic range, and realistic specular exponents). I find it (barely) enough when used as a storage format. Accumulation multiple lights (of similar brightness) would reduce the mantissa quality by roughly one or two bit (and reducing "barely enough" by two bits is not going to please the artists).
But anyway, if your register allocation is at 110, then saving four registers isn't going to magically get you down to 84, which is what you need to get a whole extra hardware thread on the ALU
If you are at 110 you are already dead
. A more realistic scenario is optimizing down from something like ~70 to 48 (or 64). This provides a big performance boost (3->5 concurrent waves). Obviously it requires quite a bit more work than saving four registers, but four is always a good start..
I have not so fond memories of debugging complex memory addressing in IL by writing stuff to a buffer instead of the kernel's output. Assembly, on PC, is still subject to the whims of the driver, so the only real solution seems to be writing/patching the ELF format.
DirectX 10/11 IL is horrible. It's still vector based. The compiler is doing silly things trying optimize your code for vector architectures (that no longer exists). All the major PC GPU vendors (+ Imagination -> Apple) have moved to scalar architectures years ago.
The only reason for writing DirectX assembly (IL) in DX8 / DX9(SM2.0) was the strict instruction limit. The first hlsl compilers were VERY bad, causing the 64 instruction limit to overflow frequently. You basically had to hand write in the assembly language in order to do anything complex with SM 2.0. The strict 64 instruction limit was a silly limit, as it was a IL instruction count limit (not an actual limit of the hardware microcode ops).
Are there whims to deal with on console if you write assembly?
I can only talk about Xbox 360 here, as Microsoft has posted most of the low level details about the architecture to public, including the microcode syntax (thanks to XNA project). Xbox 360 supported inline microcode (hlsl asm block), making it easy to write the most critical sections with microcode.
Isolate documentation:
http://msdn.microsoft.com/en-us/library/bb313977(v=xnagamestudio.31).aspx.
Other hlsl extended attributes:
http://msdn.microsoft.com/en-us/library/bb313968(v=xnagamestudio.31).aspx.
Some microcode stuff (and links to more can be found here):
https://www.google.fi/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=0CCkQFjAC&url=http://synesthetics.livejournal.com/3720.html&ei=wJO3VP3oHqKeygOd5ILgCg&usg=AFQjCNFPx3Lpdnl1iqhLQlmFsZnxySp67Q&bvm=bv.83640239,d.bGQ
Unfortunately many of the XNA pages have been removed (most likely because XNA is discontinued), so most of the links in that presentation (and some Google search results) do not work. Google cache helps.
Other techniques I forgot to mention last night...
One of the most important things is to remember is that only peak GPR usage matters. People often describe this as problem of the GPU architecture design. However it is sometimes a good thing as well, since it means that you can freely use as many GPRs in other places (assuming these new registers are not live in the peak), and you only need to optimize the peak to reduce the GPR count (not the other local peaks that are smaller than the biggest peak).
To catch up with what could have been possible with Larrabee years ago
Yes... but Larrabee was slightly too big and less energy efficient compared to the competition. Hopefully Intel returns to this concept in the future. Intel has the highest chance to pull this off, as they have quite a big process advantage.
I can't remember seeing an alternative tessellation API that would have been demonstrably as capable and cleaner. Anyway, I'm not sure if there's likely to be yet more pipeline stages.
I didn't mean that the tessellation API is messy. This API is perfect for tessellation, but it could have been more generic to suit some other purposes as well. This pipeline setup has some nice properties, such as running multiple shaders concurrently with different granularity (and passing data between them on-chip).
There are several use cases where you'd want to have different granularity for different processing and memory accesses. GCN scalar unit is helpful for some use cases (when the granularity difference is 1:64 or more), but it's not generic enough. The work suitable for the scalar unit is automatically extracted from the hlsl code by the compiler. As you said earlier, the compilers are not always perfect. I would prefer to manually state which instructions (and loads) are scalar to ensure that my code works the way I intend. Basically you need to perform "threadId / 64" before your memory request (and math) and hope for the best. It seems that loads/math based on system value semantics (and constants read from constant buffers) have higher probability to be extracted to the scalar unit. Scalar unit is also very good for reducing register pressure (as it needs to store the register once for a wave, not once per thread). If you have values that are constant across 64 threads, the compiler should definitely keep these values in scalar registers (as scalar -> vector moves are fast).
What we need is to be able to lock L2 cache lines (or call them last-level cache lines if you want to make it generic) for on-chip pipe buffering. GPUs have loads of L2. NVidia is effectively doing so as part of load-balancing tessellation.
This sounds like a good idea. However, couldn't it just store the data to memory, because all the memory accesses go though the L2 cache? If the lines are not evicted, the GPU will practically transfer data from L1->L2->L1 (of another CU). To ensure that the temporary memory areas are not written to RAM after it's being read by the other CU, the GPU should mark these pages as invalid when the other CU has received all the data. In the writing side it should of course also ensure that the line is not loaded from memory (make it a special case to implement a PPC style "cache line zero" before writing). This way it would use L2 in a flexible way, and would automatically spill to RAM when needed.
I am starting to feel that we hijacked this thread... This is starting to be a little bit out of topic already...