Faking dynamic branching - technical discussion

nelg said:
Tridam said:
It has been said that when all pixels of a given quad don't take the same branch than both branches are computed for these pixels. Does that mean that only the required branch is computed in the other case ? My results say NO. They show another limitation. However I'm still wondering if it is an hardware limitation or a limitation coming from the current drivers.
The limitation that you are wondering about, are you referring to computing both branches or the other unspecified one?

The limitation I'm talking about is the size of the batch of quads.
 
Since the size of the batches is probably related to latency hiding, I would expect that it would be invariable. Perhaps in a future architecture, they'll reduce the size of the batches if latency hiding isn't necessary (i.e. no nearby texture instructions). This would probably pave the way toward unification of pixel and vertex pipelines.
 
Tridam said:
That means that if a single pixel in these 1024 quads uses a different branch then both branches will be computed for all of these pixels.
I've heard this batch theory from other sources, so I believe it's true, but how is this supposed to work? You can only know whether you have to do both branches after you've tested the if condition for all pixels in the batch. Wouldn't that add an enormous latency?
 
Xmas said:
Tridam said:
That means that if a single pixel in these 1024 quads uses a different branch then both branches will be computed for all of these pixels.
I've heard this batch theory from other sources, so I believe it's true, but how is this supposed to work? You can only know whether you have to do both branches after you've tested the if condition for all pixels in the batch. Wouldn't that add an enormous latency?

Not really. The branching cost is around 9 cycles. Maybe that one pipeline pass is used to test the if condition and that a decison is taken after that.

Honestly I've done these tests some weeks ago and I still don't have a satisfactory explanation.

For example, if a decision is taken for every batch of 4096/8192 pixels :
I create a block of 4100/8196 pixels using a branch and a next similar block using the other branch.
The first 4096/8192 pixels should be computed at full speed (only one branch). However the next pixels shouldn't because there are 4 pixels using branch 1 and 4092/8188 pixels using branch2 in the second batch.

However that isn't working that way. It seems that the first batch is expanded to 4100/8196 pixels. I can't explain that.
 
Tridam,

Are you saying that if the number of pixels submitted for rendering is greater than 1024, then the hardware (+ driver software?) can rearrange the quads in a batch so that a batch contains only pixels taking the same branch? If on the other hand, the number of pixels submitted for rendering is less than 1024, then, there is no sorting of quads and all the quads are executed in the same batch?

Also, are you saying that the 4 quads of 4 pipelines in an Ultra won't execute quads separately until they are completed but instead runs the same instruction for the whole batch before moving onto the next? Is that for branches specifically or for pixel shaders in general?
 
Drak said:
Tridam,

Are you saying that if the number of pixels submitted for rendering is greater than 1024, then the hardware (+ driver software?) can rearrange the quads in a batch so that a batch contains only pixels taking the same branch? If on the other hand, the number of pixels submitted for rendering is less than 1024, then, there is no sorting of quads and all the quads are executed in the same batch?
No, rearranging pixels is of course not possible. I'm saying that the batch size doesn't seem to be fixed and seems to be 1024 quads (actually I think that it is 2048 quads) or more. I can't explain that.

Drak said:
Also, are you saying that the 4 quads of 4 pipelines in an Ultra won't execute quads separately until they are completed but instead runs the same instruction for the whole batch before moving onto the next? Is that for branches specifically or for pixel shaders in general?
For pixel shaders in general.
 
Xmas said:
Tridam said:
That means that if a single pixel in these 1024 quads uses a different branch then both branches will be computed for all of these pixels.
I've heard this batch theory from other sources, so I believe it's true, but how is this supposed to work? You can only know whether you have to do both branches after you've tested the if condition for all pixels in the batch. Wouldn't that add an enormous latency?
It would add latency so 1024 quads seems like a lot. Tridam, are all of the quads in your test from a single triangle or multiple triangles? I don't know that it should matter, but the results seem a little strange.
 
Tridam,
Next time you have a 6800 in your hands, could you test once more your code, with newer drivers? It would be interesting to see if there's any improvements :)
 
3dcgi said:
Xmas said:
Tridam said:
That means that if a single pixel in these 1024 quads uses a different branch then both branches will be computed for all of these pixels.
I've heard this batch theory from other sources, so I believe it's true, but how is this supposed to work? You can only know whether you have to do both branches after you've tested the if condition for all pixels in the batch. Wouldn't that add an enormous latency?
It would add latency so 1024 quads seems like a lot. Tridam, are all of the quads in your test from a single triangle or multiple triangles? I don't know that it should matter, but the results seem a little strange.

I'm doing fillrate tests on fullscreen with 2 triangles.

However I've also done some tests with more triangles. With small triangles the results says that both branches are computed for every pixel.
 
Evildeus said:
Tridam,
Next time you have a 6800 in your hands, could you test once more your code, with newer drivers? It would be interesting to see if there's any improvements :)

I have a 6800 here. I've tested my code with different driver revs and the result is the same.
 
Tridam said:
However I've also done some tests with more triangles. With small triangles the results says that both branches are computed for every pixel.
If this remains so, it seems like a stretch to say that the NV40 can do dynamic branching.
 
Back
Top