All these co-issue schemes imply significantly higher register file read bandwidth and 2x register write bandwidth, which from what I understand is quite expensive.
It probably is somewhat expensive. I have no idea how expensive. To use the FMA hardware as a separate add and mul looks like it requires an extra read and an extra write port. I don't recall the size of the register file off-hand (dkanter's article has it somewhere, IIRC), but the cost is probably something like .1% [edit: looks likes 2% -- 16 * 128k * 8b/B * 4trannies -- ouch!] of the total GPU tranny budget. Your effective flop utilization almost certainly goes up by a lot more than that.
No, the real expense, I would think, is the extra operand buffers, the extra register space you might need because you're eating through data faster, and the logic necessary to issue and handle exception cases. But if your tranny cost goes up 5%, and if utilization goes up 25%, that's worth it, no?
Lots of "ifs" there
Similar arguments for separate int and fp issue, except you need to double your read bandwidth, unless you're clever and don't support FMA co-issue between int and fp units. The advantage of doing this, though, is likely more marketing than real. Not sure how much int and fp mixed math kernels there are out there, but marketing would be able to quote much higher op/s numbers (w/ fma coissue)
I ... would be pretty surprised if the fp mul HW weren't reused in some way for integer multiplies.
If you read back, I said the same thing earlier. But, the more technical articles I read, the more I'm starting to be convinced that what we have is single-cycle DP hardware, split across two sets of int and fp units. You're right, though, if I had to bet right now, I'd say int mul and fp mul shared resources of some kind [ed/clarify: and not just mul]. Int32 is two cycles.... It's just that no article has said that yet, and lots of articles have said the opposite. :shrug:
With regards to implementing stream producer/consumer communication via L2 - why is that a bad thing?
L2 is shared. If you have eight consumers and eight producers running on 16 SMs, and the kernels are relatively short, L2 is going to get a soaking. This is really what the shared "L1" should be being used for, no? Let the spill-over L2 handle the 20% case, not the 100% case....
FWIW, in my opinion, fine-grained threading (the traditional type) is a really good idea in a message-based system, and messaging is a fine way of organizing the environmental API of a system that not only scales internally, but is going to be scaled to thousands of processor externally. But then, that's a bias that's not informed by a lot of experience with those systems.
So we are talking about up to 4 FLOPs per CUDA core?
My current assumption is that we have 2 flops and 2 unused ints, or 2 flops and 2 used ints (for DP), or 2 ints and 2 unused flops. Never 4. I am mostly musing here....
What's so inefficient about putting circular append buffers between producer/consumer kernels and branching to consumers when they have full warps?
Hmm, am I missing something? AFAIK, you can't have two kernels active on the same SM in GF100, so you can't switch from one to the other? I think Jawed was saying that branching was an inefficient workaround for lack of per-SM parallelism. That said, I suppose you could:
MegaKernel:
<lock-common resource>
increment shared variable and load
<unlock>
branch to kernel2 if variable%64 >=32
kernel1:
--blah--
exit
kernel2:
--blah--
at a minimum, I'd prefer that kind of ugliness hidden for me