Improving dynamic branching beyond R520 and G80?

Arun

Unknown.
Moderator
Legend
So, here we are. We got from a branch coherence of 4096 pixels in 2004 to one of 16 to 48 pixels in modern architectures. Relatively speaking, that's minuscule. In the absolute sense of things, however, it's still a lot - and like mhouston noted in this post, this is significantly affecting the performance of some GPGPU algorithms, including raytracing.

So I'm just trying to kickstart a technical discussion on the subject here, since I think we'll all agree that this forum is perhaps focusing a tad too much on G80 and R600 in the last few weeks/months, and architectural discussions that aren't bonded to a specific chip or company can also be extremely interesting.

So, first, the reasons and benefits of having this level of branching coherence on modern architectures. Disregarding the extra control logic, it seems to me lower coherence would require massively multi-ported (or double/quad-clocked) register files (as I have described here), which would be more expensive per stored byte. Is that correct, and if so, what else would be necessary?

And then, what kinds of hacks could be applied to reduce the branching coherence further, while possibly also having other side effects? The Intel G965 IGP can work on Vec4 batches of 4 threads instead of scalar batches of 16 threads depending on the type of shader, which is an interesting trade-off. The G80 has a batch size for the VS/GS that is half what it is for the PS, but I'm not entirely sure how that works (any idea, anyone? I'd assume two distinct single-ported register banks/files and properly scheduling that, but it feels ugly)

In theory, if you look at G80, if you added a bunch of control logic, worked on Vec8 instructions instead of scalar ones (lol?) and had a quad-ported register file, you might have eliminated the coherence problem - but your performance is probably way worse overall (Vec8...!) and your chip is certainly bigger. How awesome. I'd buy that! Or maybe not... ;)

So, longer-term, I can think of 2 solutions, but I'm sure there are plenty more (that's what this thread is for, you know!) which are:

A) Get rid of "true" unification and have two kinds of ALUs/schedulers/etc. - one kind which is really tuned towards branching and the other which, well, isn't. This kind of paradigm is already kinda present in NV4x/G7x, but it lacks thread spawning and feedback mechanisms etc. which would make it more useful for GPGPU or other things in general. This could arguably also be seen as CPU-GPU integration with a low latency internal bus, but perhaps the control processors might be massively parallel anyway for efficiency's sake, and x86 is also arguably a tad ridiculous for this.

B) If you are massively parallel and do not require things such as derivatives, you could try rearranging your batches over time, at the cost of some extra control logic. Take G80, and "queue" 4 batches before sending anything to the ALU pipelines in the case of a branch. Then, rearrange all the batches' individual threads so as to increase coherency, and then invert that operation again when writing to the register file. Further extensions of this scheme could be applied to pixel shaders and improve performance of small triangles, by "batching" things up better in quads (which requires making sure the derivatives are still correct!).

So obviously, there must be tons of possibilities I'm missing, and some things I've listed which might not even work at all. But hey, if the original post included every possible and imaginable information, it'd be pretty hard to talk about it further anyway! :)


Uttar
 
I've think that "thread packing" with a base unit of a quad is the way to go, avoiding the derivative problem.

Jawed
 
Back in 2004, at GeForce 6800 time, I was talking with Nvidia about SIMD-like behavior, branching... Here is a quote from David Kirk's answers on that matter:

If you had 2048 processors, each executing 256 threads in parallel, you might not mind so much if groups 64 or 128 of them are somewhat limited in their ability to run at full speed with full branch generality. This is probably where we're going in a few years.
 
If you had 2048 processors, each executing 256 threads in parallel, you might not mind so much if groups 64 or 128 of them are somewhat limited in their ability to run at full speed with full branch generality. This is probably where we're going in a few years.
Is he just saying there that as long as you've got enough parallelism, branch coherence doesn't matter as much? Or am I misunderstanding something here? If so, that's an interesting point of view, but I fail to see how it makes sense for certain classes of algorithms.

I'd say on-chip control processors or batch reordering (as listed above) are the two biggest solutions I can see; and both are quite compatible with not caring much about the architecture's actual branch coherence. I'm not sure how some things can really scale up from here (although maybe mostly for GPGPU) if architectural changes aren't done for branching. Although who knows, perhaps most of the "algorithms that matter" can be reformulated to be more friendly in terms of branching... I'm just very skeptical about that right now, though.


Uttar
 
The G80 has a batch size for the VS/GS that is half what it is for the PS, but I'm not entirely sure how that works (any idea, anyone? I'd assume two distinct single-ported register banks/files and properly scheduling that, but it feels ugly)
I assume it's this way because PS needs more latency hiding than VS and they didn't want to increase the thread state storage. A way around this is to have more pixels per thread.
 
I assume it's this way because PS needs more latency hiding than VS and they didn't want to increase the thread state storage. A way around this is to have more pixels per thread.
It'd be interesting to know how much dynamic branching is used in vertex shaders. And whether that materially affects the performance of any existing SM3 games.

Jawed
 
It'd be interesting to know how much dynamic branching is used in vertex shaders. And whether that materially affects the performance of any existing SM3 games.
Branching in vertex shaders for SM3-level games is *mostly* used to skip work for animation. And that stuff is fairly coherent too, so a batch size of 16 vertices is unlikely to be a big problem - given unification, it's fairly obvious this will never be a bottleneck, either. There might be more exotic algorithms I'm not thinking of right now, but I don't see many used in SM3 games out there...


Uttar
 
Branching in vertex shaders for SM3-level games is *mostly* used to skip work for animation.
Has anyone managed to measure/document the performance gained here? I'm curious whether it's on the critical path, and if so where the impact is greatest - in low-end cards or high-end. Don't lower-end non-unified cards tend to have proportionally more VS capability (as measured against PS) than higher end cards?

Jawed
 
Arun (or should I stick with Uttar?)

Taking g80 as a basis for discussion - I definitely think all 8 ALUs in a SIMD array should be exposed to a single data item so that e.g. 1 pixel could issue a vector instruction, or maybe a VLIW to use all 8 ALUs. On top of that, I had this probably stupid idea... Say we are dealing with a batch of 16 pixels:
  • When generating batches, randomize the order of the vector elements (don't use a fixed screen-space pixel->ALU mapping). The pixel->ALU mapping would remain fixed for the lifetime of the batch.
  • Assume the per-cluster scheduler is able to hand off batches to the ALUs faster than the ALU array can theoretically consume batches that don't ever branch (say twice as fast).
  • Each ALU has a small local scheduler, and a small set of batches (I have no idea how many, say 16) it can run instructions for. The only thing the local scheduler does each cycle is to pick the oldest pixel from the 16 in the local set and, if there is one, issue it to the local ALU. When the global scheduler gives it an instruction from a batch to process, it only puts that instruction/batch into the set of locally active batches if at least one of the corresponding pixels is active.
So the main idea is to keep batches coherent when they "enter/exit" the ALU array, but let them become totally decoherent inside the ALUs. They would also remain coherent for the purposes of memory/TMU access. Randomizing the pixel->ALU mapping on a per batch basis is done in hopes that each ALU discards roughly the same number of inactive pixels (pixels in the "not taken" branch of a batch). The global scheduler can issue twice as fast as necessary for completely active batches on the assumption that on average about 1/2 the pixels per batch won't be on the current code path.

Anyway that was my thought. Besides having no clue whether this is feasible in HW, I'm not totally sure it's a good idea or would actually help all that much :D.

Edit - I forgot about derivatives. Maybe one could use the same scheme, except that the unit by which things are reordered is a quad of pixels, or a single vertex. I suppose one could issue quads to a single ALU to avoid having to spread data between ALUs (maybe this is why g80 issues 32 pixels to a 8 ALUs?) ...
 
Back
Top