So, here we are. We got from a branch coherence of 4096 pixels in 2004 to one of 16 to 48 pixels in modern architectures. Relatively speaking, that's minuscule. In the absolute sense of things, however, it's still a lot - and like mhouston noted in this post, this is significantly affecting the performance of some GPGPU algorithms, including raytracing.
So I'm just trying to kickstart a technical discussion on the subject here, since I think we'll all agree that this forum is perhaps focusing a tad too much on G80 and R600 in the last few weeks/months, and architectural discussions that aren't bonded to a specific chip or company can also be extremely interesting.
So, first, the reasons and benefits of having this level of branching coherence on modern architectures. Disregarding the extra control logic, it seems to me lower coherence would require massively multi-ported (or double/quad-clocked) register files (as I have described here), which would be more expensive per stored byte. Is that correct, and if so, what else would be necessary?
And then, what kinds of hacks could be applied to reduce the branching coherence further, while possibly also having other side effects? The Intel G965 IGP can work on Vec4 batches of 4 threads instead of scalar batches of 16 threads depending on the type of shader, which is an interesting trade-off. The G80 has a batch size for the VS/GS that is half what it is for the PS, but I'm not entirely sure how that works (any idea, anyone? I'd assume two distinct single-ported register banks/files and properly scheduling that, but it feels ugly)
In theory, if you look at G80, if you added a bunch of control logic, worked on Vec8 instructions instead of scalar ones (lol?) and had a quad-ported register file, you might have eliminated the coherence problem - but your performance is probably way worse overall (Vec8...!) and your chip is certainly bigger. How awesome. I'd buy that! Or maybe not...
So, longer-term, I can think of 2 solutions, but I'm sure there are plenty more (that's what this thread is for, you know!) which are:
A) Get rid of "true" unification and have two kinds of ALUs/schedulers/etc. - one kind which is really tuned towards branching and the other which, well, isn't. This kind of paradigm is already kinda present in NV4x/G7x, but it lacks thread spawning and feedback mechanisms etc. which would make it more useful for GPGPU or other things in general. This could arguably also be seen as CPU-GPU integration with a low latency internal bus, but perhaps the control processors might be massively parallel anyway for efficiency's sake, and x86 is also arguably a tad ridiculous for this.
B) If you are massively parallel and do not require things such as derivatives, you could try rearranging your batches over time, at the cost of some extra control logic. Take G80, and "queue" 4 batches before sending anything to the ALU pipelines in the case of a branch. Then, rearrange all the batches' individual threads so as to increase coherency, and then invert that operation again when writing to the register file. Further extensions of this scheme could be applied to pixel shaders and improve performance of small triangles, by "batching" things up better in quads (which requires making sure the derivatives are still correct!).
So obviously, there must be tons of possibilities I'm missing, and some things I've listed which might not even work at all. But hey, if the original post included every possible and imaginable information, it'd be pretty hard to talk about it further anyway!
Uttar
So I'm just trying to kickstart a technical discussion on the subject here, since I think we'll all agree that this forum is perhaps focusing a tad too much on G80 and R600 in the last few weeks/months, and architectural discussions that aren't bonded to a specific chip or company can also be extremely interesting.
So, first, the reasons and benefits of having this level of branching coherence on modern architectures. Disregarding the extra control logic, it seems to me lower coherence would require massively multi-ported (or double/quad-clocked) register files (as I have described here), which would be more expensive per stored byte. Is that correct, and if so, what else would be necessary?
And then, what kinds of hacks could be applied to reduce the branching coherence further, while possibly also having other side effects? The Intel G965 IGP can work on Vec4 batches of 4 threads instead of scalar batches of 16 threads depending on the type of shader, which is an interesting trade-off. The G80 has a batch size for the VS/GS that is half what it is for the PS, but I'm not entirely sure how that works (any idea, anyone? I'd assume two distinct single-ported register banks/files and properly scheduling that, but it feels ugly)
In theory, if you look at G80, if you added a bunch of control logic, worked on Vec8 instructions instead of scalar ones (lol?) and had a quad-ported register file, you might have eliminated the coherence problem - but your performance is probably way worse overall (Vec8...!) and your chip is certainly bigger. How awesome. I'd buy that! Or maybe not...
So, longer-term, I can think of 2 solutions, but I'm sure there are plenty more (that's what this thread is for, you know!) which are:
A) Get rid of "true" unification and have two kinds of ALUs/schedulers/etc. - one kind which is really tuned towards branching and the other which, well, isn't. This kind of paradigm is already kinda present in NV4x/G7x, but it lacks thread spawning and feedback mechanisms etc. which would make it more useful for GPGPU or other things in general. This could arguably also be seen as CPU-GPU integration with a low latency internal bus, but perhaps the control processors might be massively parallel anyway for efficiency's sake, and x86 is also arguably a tad ridiculous for this.
B) If you are massively parallel and do not require things such as derivatives, you could try rearranging your batches over time, at the cost of some extra control logic. Take G80, and "queue" 4 batches before sending anything to the ALU pipelines in the case of a branch. Then, rearrange all the batches' individual threads so as to increase coherency, and then invert that operation again when writing to the register file. Further extensions of this scheme could be applied to pixel shaders and improve performance of small triangles, by "batching" things up better in quads (which requires making sure the derivatives are still correct!).
So obviously, there must be tons of possibilities I'm missing, and some things I've listed which might not even work at all. But hey, if the original post included every possible and imaginable information, it'd be pretty hard to talk about it further anyway!
Uttar