Since Vista, all graphics APIs have been primarily user-mode. Prior to Vista, OpenGL was always user-mode except for "swap buffers" which was handled by the KMD. There is a still a KMD component for memory management, display management, etc.
Note that those "16"s are not related to each other at all since if you had a workgroup with 16 wavefronts, you couldn't store 16 workgroups in a CU :)
GCN has 64KB of registers per SIMD, so 256KB per CU:
4 SIMDs per CU * 64 threads per SIMD * 256 registers per thread per SIMD * 4 bytes per register = 256 KB per CU.
No problem getting over 230 GB/s. About 277 GB/s pure read b/w and 275 GB/s pure write b/w on a FE GTX 1080. Getting over 300 GB/s would be pretty incredible as that would be well over 90% memory utilization, which is tough on any GPU I've seen.
Also, you can easily check the memory clock in...
I assume you mean that GPU A is supposed to be Nvidia Maxwell or Pascal. You should note that Maxwell takes at least 4 warps per SMM to get peak ALU rate since there are 4 vector units per SMM. A single warp per SMM can only harness, at best, 1/4 of the SMM's ALU horsepower, and at worst 1/24th.
sebbbi is correct if referring to how a CU works. On GCN, there are 4 SIMDs per CU. Each SIMD executes a typical instruction in 4 clocks as there are 16 ALUs per SIMD so a 64-thread wavefront takes 4 clocks to process.
I agree that PC tools are lacking in this regard.
You only need to worry about the bandwidth of spilling if your kernel is largely memory-bound. If you are compute-bound, then you might have enough work to hide the latency and you likely have bandwidth to spare. This is why it's crucial to...
Right, but, as I stated, you are free to spill.
This is not always the case and I will leave it at that. Also, how would the compiler report warnings about spilling? Sure, this is possible in OpenCL where there is a log for the compilation, but what about other APIs?
Spilling is not always...
A kernel is free to use as many registers as it needs, it's the compiler that has to work within the limits of the hardware. If the kernel uses more registers than are available, then spilling will occur. With a work group of 1024 threads, you will get up to 64 registers per work-item on GCN...
Demonstrably false. DirectCompute requires support for work group sizes of at least 1024 threads, for example.
This has been recommended on GPU for ages. See the huge progress made with LuxRenderer for another example.
You don't know that. There is more to performance than shader optimizations.
It doesn't matter how many there are, what matters is how important they are to reviews and gamers, wouldn't you agree?
Windows API? What is that? Do you mean the DDI layer? If so, how would one correlate the DDI...
So suppose AMD does these optimizations. What then? Do you think they have absolutely no CPU overhead? Many people complain about the apparent increased CPU overhead of AMD's drivers relative to Nvidia, yet they never stop to consider why that might be true.
Regarding the closed source library...