G80 programmable power

They would be, if they were referring to the same diagrams! Bob's comment was on an earlier one.
b3d68.jpg

b3d82.gif


In this diagram, the blue instruction begins before the purple one finishes, even though they are dependent.
Where's that sigh smiley?

If the MAD and SF pipes are multi-threaded, as you are suggesting, then there's no instruction dependency between the magenta and blue. The magenta is batch 1432 and the blue is batch 1543, etc.

So, make up your mind, are these separate pipes that support multi-threading or is there instruction dependency?

Jawed
 
a) That diagram and its associated text does not mention multiple batches being executed at the same time.
b) Your definition of instruction dependency is non-standard. If a program's operation cannot begin before another one has completed, then it is dependent on that latter instruction.
c) It doesn't matter if you can fill up your pipeline anyway through multithreading, OoOE, or anything else. The operation is still "dependent".
d) The batches won't be dependent on each other, on the other hand, which is why this works.

So, yes, the diagram you just quoted gives the correct usage characterisitics unless I'm not reading it properly, but you certainly never said that it managed to do so thanks to the scheduler concurrently managing a large number of batches and seeing the ALUs and SFUs as distinct execution resources. If you did mean that, then sorry, I guess you were just being vague... :)


Uttar
 
So, yes, the diagram you just quoted gives the correct usage characterisitics unless I'm not reading it properly, but you certainly never said that it managed to do so thanks to the scheduler concurrently managing a large number of batches and seeing the ALUs and SFUs as distinct execution resources. If you did mean that, then sorry, I guess you were just being vague... :)
Technically the diagram is accidental, because I didn't intend it be read as multi-threaded when I produced it. I wasn't paying attention to the dependency twixt magenta and blue when I produced it, which is why I retracted it in light of Bob's comment.

Bob picked up the instruction dependency clash. If Bob knows that MAD and SF are separately threaded, then there'd be no instruction dependency, and he wouldn't have needed to make the comment about the dependency clash.

So, that still leaves me no better off. If G80 actually does multi-thread in the way you're asserting, then I don't know why Bob even raised the dependency issue.

Jawed
 
Bob picked up the instruction dependency clash. If Bob knows that MAD and SF are separately threaded, then there'd be no instruction dependency, and he wouldn't have needed to make the comment about the dependency clash.
Sigh x Infinity.
Uttar said:
b) Your definition of instruction dependency is non-standard. If a program's operation cannot begin before another one has completed, then it is dependent on that latter instruction.
c) It doesn't matter if you can fill up your pipeline anyway through multithreading, OoOE, or anything else. The operation is still "dependent".
d) The batches won't be dependent on each other, on the other hand, which is why this works.
There IS an instruction dependency if you use the term's proper definition. But the batches are independent, as they always are on modern GPUs. All of Bob's comments are perfectly accurate - they just aren't super-precise, for obvious reasons.


Uttar
 
All of Bob's comments are perfectly accurate - they just aren't super-precise, for obvious reasons.
No they're not perfectly accurate as far as your assertion of multi-threading goes. If RCP r5.y r5.y has already been calculated due to multi-threaded scheduling, then the MUL that uses r5.y can't suffer from a dependency clash. It's simply impossible.

I can't find any reason for Bob to point out dependency if he well knows that multi-threading obviates SF->MAD dependency.

Jawed
 
No they're not perfectly accurate as far as your assertion of multi-threading goes. If RCP r5.y r5.y has already been calculated due to multi-threaded scheduling, then the MUL that uses r5.y can't suffer from a dependency clash. It's simply impossible.

I can't find any reason for Bob to point out dependency if he well knows that multi-threading obviates SF->MAD dependency.

Multi-threading doesn't obviate SF->MAD dependency within a batch. It just means the MAD units can be busy working on batch B while batch A is waiting for the SF operation to finish.

It would clear things up a lot if you clearly marked which batch each operation is from, and whether the diagram involves a single batch or multiple batches. At first glance, the diagrams appear to be for a single batch and assume there are no other batches, in which case you would end up with idle units during dependency bubbles. In a real system running a real workload those holes would be filled with work from other batches, which is what I think Bob and Uttar are saying: assuming enough batches and ALU-limitedness, at least one of the pipelines should always have useful work to do.
 
You can construct cases where Vec4 is just as efficient as scalar, but Uttar is correct, in real code scalar will be more efficient. The question is how much more efficient vs. any die area costs. This is something consumers won't be able to tell due to too many variables.

I wish these points were better understood in 2005 when the initial discussion of ALU utilization on unified and non-unified designs was being discussed and the comments like, "Developers will just write code to maximize ALUs".

I suppose I might as well post how I think G71 executes this:

ooh and Xenos, too:

Jawed

Thanks Jawed... interesting discussion. The charts are helpful, even if we haven't nailed down how accurate they are yet ;) btw, the R600 chart you did at 97% utilization... are you dreaming? :LOL:

Kind of surprising in your first example that R580 is close to Xenos in utilization (46% to 49%).
 
Last edited by a moderator:
Multi-threading doesn't obviate SF->MAD dependency within a batch. It just means the MAD units can be busy working on batch B while batch A is waiting for the SF operation to finish.
You'll get a stall if there are no other batches ready to run, but otherwise the dependency is always immaterial as far as unit throughput is concerned.

It would clear things up a lot if you clearly marked which batch each operation is from, and whether the diagram involves a single batch or multiple batches. At first glance, the diagrams appear to be for a single batch and assume there are no other batches, in which case you would end up with idle units during dependency bubbles.
The diagrams were constructed on the basis that MAD and SF pipelines are bound together in a dual-issue pairing - i.e. there is no multi-threaded scheduling of MAD and SF, it's a single batch.

In a real system running a real workload those holes would be filled with work from other batches, which is what I think Bob and Uttar are saying: assuming enough batches and ALU-limitedness, at least one of the pipelines should always have useful work to do.
In the diagrams (all variants for G80) one of the pipelines does indeed always have useful work to do. Bob and Uttar are separately saying that the MAD pipeline, being "critical", should always be running at 100%, regardless of the SF workload.

If Bob means that the holes are filled by multi-threading, then great, let him say so. In the meantime I'm certainly not paying $1000+ to find out :LOL: particularly when B3D staff have the hardware already. It puzzles me why this basic stuff is still missing from the architecture article.

Jawed
 
I wish these points were better understood in 2005 when the initial discussion of ALU utilization on unified and non-unified designs was being discussed and the comments like, "Developers will just write code to maximize ALUs".
Unification is not the primary variable in ALU utilisation. I made a point about this, e.g. vec4+scalar versus vec3+scalar, earlier in the thread - but that pales in comparison with the question of co-issue/dual-issue and how the register file is constructed to support operand fetches, which gets more and more complicated as co- or dual-issue gets more sophisticated.

Thanks Jawed... interesting discussion. The charts are helpful, even if we haven't nailed down how accurate they are yet ;) btw, the R600 chart you did at 97% utilization... are you dreaming? :LOL:
Since been revised to 100% :D

Kind of surprising in your first example that R580 is close to Xenos in utilization (46% to 49%).
They're both pretty poor, aren't they? These examples are only showing how utilisation can fall off - they're not indicative of typical utilisation. But it's good to undermine thinking solely in terms of how many shader ops a GPU can do, or what are its GFLOPs.

I originally started playing with code samples like this to consider G80's scalar pipeline before the launch.

I'm intrigued to find out if MAD and SF are multi-threaded...

Jawed
 
In the meantime I'm certainly not paying $1000+ to find out :LOL: particularly when B3D staff have the hardware already. It puzzles me why this basic stuff is still missing from the architecture article.
It's not missing and I should know since I wrote it. It's maybe not as explicit as you'd like (and maybe it's not even on the page it should be :LOL: ), but it's there. And if you want us to run some code for you, perhaps you'd like to generate some binaries and make it nice and easy, so there's no further faffing around?
 
It's not missing and I should know since I wrote it. It's maybe not as explicit as you'd like (and maybe it's not even on the page it should be :LOL: ), but it's there. And if you want us to run some code for you, perhaps you'd like to generate some binaries and make it nice and easy, so there's no further faffing around?
Glad to see you have such pride in your work.

Jawed
 
Glad to see you have such pride in your work.

Jawed
I'm hilariously proud of it. That it doesn't tell you what you want is neither here nor there, since I didn't write it for you. I'm marginally upset you continue to give me so much poorly veiled grief about it, but then me being upset seems to be your goal, so you succeeded there at least :p

Back to the topic on hand, how about that code you want us to run? You might even use the hardware I gave to you in good faith to help write it......
 
Why did you assume otherwise? I can't recall the SF/Interpolator patent in detail but I don't remember anything indicating a scheduling dependency on ALU ops.
It's just another "ALU". Also, to be multi-threaded it requires another load of distinct scheduling/arbitration hardware to be dedicated to it. Finally, for a while it was supposed to be the home of the missing MUL (maybe it still is, past caring). Oh, and of course, Bob said I had a dependency clash, which can't happen if they're multi-threaded...

I'm not saying it's impossible, just that it seems costly and not my first-choice assumption.

In the meantime, I've thought of a different way to schedule G80:

b3d89.gif


Here I have "doubled-up" the scheduling of pixels in a batch. I should quadruple-up the pixels in a batch, because that's how it actually executes. But that would make the diagram nearly twice as long, and wouldn't make any difference to the solution I've found for this code sample. If the code sample had a MAD-pipeline scalar instruction, then that might have necessitated the quadrupled-up diagram.​

This solution doesn't require multi-threaded MAD and SF units, because what I've shown all comes from a single thread (batch).​

So, as far as I can see, by ordering instruction-issue by program counter within a batch (not by pixel) it's possible for G80's compiler to maximise MAD utilisation - whether pixel or vertex or geometry shader - i.e. whether 32 or 16 object batches are issued. This seems like a pretty compelling solution to me. I think it's still possible for holes to appear in the MAD unit, if there's too many SF (or MI) instructions in the dependency chain.​

(Grouping by PC is how I solved the scheduling in R600 - but it took a while for me to transfer the concept to G80 :oops: )​

Jawed​
 
Oh, and of course, Bob said I had a dependency clash, which can't happen if they're multi-threaded...
Are you doing this on purpose? If your goal is to piss off everyone by missing the point on purpose, then you, sir, have successfully done so. If that wasn't the goal, then... :-|


Uttar
 
Why did you assume otherwise? I can't recall the SF/Interpolator patent in detail but I don't remember anything indicating a scheduling dependency on ALU ops.


There is no dependency of the SF and ALU ops, Bob was hinting at something else.
 
It's just another "ALU". Also, to be multi-threaded it requires another load of distinct scheduling/arbitration hardware to be dedicated to it. Finally, for a while it was supposed to be the home of the missing MUL (maybe it still is, past caring). Oh, and of course, Bob said I had a dependency clash, which can't happen if they're multi-threaded...


Hmmmm, yeah I couldn't find where the B3D article indicated whether MAD+SF was setup as a dual-issue or were independently scheduled so I'm not sure what Rys is referring to.
 
Back
Top