The penalty of flow control

Scarlet · Apr 23, 2004

DiGuru said:
Most transistors that go into an ALU consist of execution logic. Next to the instruction fetch/decode logic, you need a bunch of transistors to execute each opcode. And you need transistors that determine where to store what. And transistors to keep track of what each part is used for. And caches to store the state of all that when switching fragments and shaders.

Strictly speaking, I dont think most transistors in an ALU are execution logic. With heavy pipelining and synchronization demands, I would bet most transistors in an ALU are used for holding the results at pipeline stage boundaries. True those transistors are in the ALU, but they aren't computational transistors.

Still it is an interesting question.

I would also bet there is very little i-decode logic. Do shaders even have instruction CROMs to translate from the "opcode" to actual wide-word direct hardware codes? Or is everything VLIW (that would be my guess)?

KimB · Apr 23, 2004

Scarlet said:
I would also bet there is very little i-decode logic. Do shaders even have instruction CROMs to translate from the "opcode" to actual wide-word direct hardware codes? Or is everything VLIW (that would be my guess)?

I've read that the NV3x was VLIW, but nVidia felt it was too cumbersome to optimize for, and so moved to SIMD/MIMD for the NV40 and added some hardware for instruction decoding/scheduling.

Simon F · Apr 23, 2004

DiGuru said:
Most transistors that go into an ALU consist of execution logic. ......

The registers that do the actual work are just a small part of it.

Really? Are you sure? A 3.0 VS requires a lot of storage.

991060 · Apr 23, 2004

Chalnoth said:
I've read that the NV3x was VLIW, but nVidia felt it was too cumbersome to optimize for, and so moved to SIMD/MIMD for the NV40 and added some hardware for instruction decoding/scheduling.

Looks interesting, I hope you still keep the link talking about NV3X & VLIW in your favorite folder, do you?

KimB · Apr 23, 2004

991060 said:
Chalnoth said:

I've read that the NV3x was VLIW, but nVidia felt it was too cumbersome to optimize for, and so moved to SIMD/MIMD for the NV40 and added some hardware for instruction decoding/scheduling.

Click to expand...

Looks interesting, I hope you still keep the link talking about NV3X & VLIW in your favorite folder, do you?

Nope, but I had the link in the history of my browser:
http://www.extremetech.com/article2/0,1558,1567087,00.asp

Xmas · Apr 23, 2004

Why is VLIW opposed to SIMD/MIMD, Chalnoth?

KimB · Apr 23, 2004

Xmas said:
Why is VLIW opposed to SIMD/MIMD, Chalnoth?

Well, I meant to imply that VLIW was dropped. Anyway, yes, I suppose SIMD/MIMD really are completely orthogonal to VLIW.

I'm really not sure what I was thinking there. I guess it would be more correct to state that the NV3x was VLIW, but the NV4x is not.

Frank · Apr 23, 2004

Ok. First, about this:

Tridam said:
DarN said:

Yes, but does it handle it Well?

Click to expand...

The answer is "No" at least with the current drivers (60.72). PS3.0 support isn't exposed until DX9.0c is installed. I've done some quick tests for our review. Branching is very costly. So costly that it looks like the drivers are buggy. 9 pipeline passes for the simplest branching I was able to think about :

Code:

ps_3_0 dcl vFace def c0, 0, 0, 0, 0 def c1, 1, 0, 0, 0 def c2, 0, 1, 0, 0 if_ge vFace, c0.x mov oC0, c1 else mov oC0, c2 endif

The first thing I would assume, is that the driver or chip determines that the result of the compare will probably be different for each pixel and switches back to serial processing of the fragments. Next, it might take a clock to calculate the compare, it might take a clock to pop the next fragment, or it takes a clock to do both at the same time (most likely). Next, it takes one clock for each mov, with the last clock to assemble the result back into a quad.

How does that sound?

Frank · Apr 23, 2004

Scarlet said:
DiGuru said:

Most transistors that go into an ALU consist of execution logic. Next to the instruction fetch/decode logic, you need a bunch of transistors to execute each opcode. And you need transistors that determine where to store what. And transistors to keep track of what each part is used for. And caches to store the state of all that when switching fragments and shaders.

Click to expand...

Strictly speaking, I dont think most transistors in an ALU are execution logic. With heavy pipelining and synchronization demands, I would bet most transistors in an ALU are used for holding the results at pipeline stage boundaries. True those transistors are in the ALU, but they aren't computational transistors.

Still it is an interesting question.

I think you are correct, buffers and datapaths make up a very large part of a chip. But that is a good argument for quads as well: most of the states only have to be stored once for the whole quad, while the datapaths are vastly reduced.

MfA · Apr 23, 2004

DiGuru said:
I think you are correct, buffers and datapaths make up a very large part of a chip. But that is a good argument for quads as well: most of the states only have to be stored once for the whole quad, while the datapaths are vastly reduced.

Ignoring interpolants most of those buffers and datapaths would be 4 times as wide for a quad of course, so Im not so sure vast is the correct term.

hstewarth · Apr 23, 2004

What I would like to see from ones that know about it. With a fairly complex PS 3.0 code, would would it take to do equilivent code in PS 2.0 with out such things as flow control. First of all is possible?

Frank · Apr 23, 2004

MfA said:
DiGuru said:

I think you are correct, buffers and datapaths make up a very large part of a chip. But that is a good argument for quads as well: most of the states only have to be stored once for the whole quad, while the datapaths are vastly reduced.

Click to expand...

Ignoring interpolants most of those buffers and datapaths would be 4 times as wide for a quad of course, so Im not so sure vast is the correct term.

It depends on the design, for example: are registers shared around or do they all have a specific function? EG, if a chip has 32 temp registers, are they only useable as temps and do they always thave the same number, or are they from a pool? And how are they stored and looped around the pipeline? Hard to know for sure, but SIMD leaves a lot of room for optimizing.

You need less datapaths around the quad as well, for example in the crossbar between the VS and PS. The difference would not be that big if the crossbar has individual paths for each fragment, but it would be a large difference if those connections are fixed width.

DW Fan!!!!! · Apr 23, 2004

hstewarth said:
What I would like to see from ones that know about it. With a fairly complex PS 3.0 code, would would it take to do equilivent code in PS 2.0 with out such things as flow control. First of all is possible?

You seem desperate to find something that Nvidia can do which ATI will be unable to do as if you personally need the reasurance that you are right, these are typical traits of IHV fan mentality, I decided after reading this to look back over all of your 34 posts's and everyone one of them is biased, you were convinced from your first ever post here that the 6800U will be the top card even though we have seen nothing from the R420 yet, in fact you were so convinced you have already said you are going to buy it.

IMO this is stupid, buying something when you have no idea weather it will be trounced or not and the only people who do things like this are those with that most vicious of all internet diseases, IHV Love syndrome.

I have already noted this about Malfunction today and found his past post's almost funny, that is what made me read yours and funnily enough I found those funny 2.

You must both take the weekend off and rest up in bed otherwise IHV Love syndrome could end up being permanant and you will lose all respect and friends.

MfA · Apr 23, 2004

Room needed for actual registers is insignificant compared to temporary storage needed to overcome memory latency ... and the amount of area needed for that is almost linear with the width of data being stored.

As far as ATI is concerned there soon wont be any dedicated vertex or pixel shaders anymore ... I dont really care what is best for an architecture which doesnt go that way, it is a lousy one going forward anyway.

Frank · Apr 23, 2004

MfA said:
Room needed for actual registers is insignificant compared to temporary storage needed to overcome memory latency ... and the amount of area needed for that is almost linear with the width of data being stored.

Agreed. And as we don't know how the chips look like on that level, it is hard to say much more about it.

As far as ATI is concerned there soon wont be any dedicated vertex or pixel shaders anymore ... I dont really care what is best for an architecture which doesnt go that way, it is a lousy one going forward anyway.

Yes, a superscalar General Purpose Graphics Unit. Something inbetween the current design and a GPU with lots of specific functions to handle vectors. There is really not much choice, I guess, as we see that the current quad-based design isn't flexible enough and it will never be fast doing general things like flow control.

hstewarth · Apr 23, 2004

DWFan,

man!!! that was exact type of responce that I was not looking for.. provides no logical information about what I desiring to get out of the question.

This is not a bias statement.. All I am interesting in find technically if what PS3.0 can be done can be done PS2.0 and what will the code look like.

I don't program these shaders but I do program and have look at them. If a some with knowledge of this and not some fan bias statement, There would be more to go off than then just speculation.

Also information like this could provide compatibily between the two card mnaufactors in games.

Simply put.. Can a 2.0 shader be written to what any 3.0 shader can do. If so a game could have collection of 2.0 shaders for each of 3.0 shaders and if hardware supporst the 3.0 - use the 3.0. If there are features that 2.0 does not support that 3.0, it would be important to know so that game developer would need to know its limitations.

psurge · Apr 23, 2004

As far as interpolants, i think it's a good idea to pass a barycentric triangle coordinate to each pixel anyway (as opposed to up to 10 vec4s or whatever it is). To hide texture/memory latency you want to use as little space as possible per pixel in flight, and actually storing that much info seems ridiculous.

As far as gradients are concerned, it shouldn't be all that difficult to take a program for a single fragment, and basically interleave 3 (or 4) copies of it so that it actually computes results for 3 fragments simultaneously. This allows you to share constants (static/computed) and gives increased opportunity for super-scalar issue.

You could still use a texture unit for 3 pixels at a time, and let the fragment compiler take care of syncing tex loads so that you essentially issue a single instruction which says "give me filtered texture values for these N pixels".

Humus · Apr 24, 2004

Hyp-X said:
What I'm worrying about is if I use HLSL (or Glslang), how do I specify that an if should be compiled as a predication/cmp-struct or a dynamic branch?

Will compilers try to decide that for me?

Code:

if (tex2D(sampler, coords) > 0.5) oColor = func1(); else oColor = func2();

In this case wether predication or a dynamic branch is faster might depend on the actual texture!

Maybe it will select depending on how you write?

This could result in a branch:

Code:

if (tex2D(sampler, coords) > 0.5)
    oColor = func1();
else
    oColor = func2();

and this could use predicates:

Code:

oColor = (tex2D(sampler, coords) > 0.5)? func1() : func2();

Frank · Apr 24, 2004

In a SIMD architecture, if using predicates, that might still require different instructions issued for each fragment. That will only be faster if it doesn't 'break' the quad (ie. the chip can issue the instructions for each fragment and continue executing the whole quad afterwards, instead of dropping down to serial processing).

On the other hand, a branch based on a condition that can be determined to be the same for the whole quad (for example, based on a vertex property), will have no (or a very small) penalty if the driver or chip recognizes this. (Like the Nalu demo).

In both cases, I think the main question is: when will the chip just calculate all results and assign the correct one, when will it be able to determine the whole quad takes the same path and when will it always drop down to sequential processing?

If the developers know that, they can write shaders that behave consistent. That makes flow control useful.

DemoCoder · Apr 24, 2004

DiGuru said:
In a SIMD architecture, if using predicates, that might still require different instructions issued for each fragment. That will only be faster if it doesn't 'break' the quad (ie. the chip can issue the instructions for each fragment and continue executing the whole quad afterwards, instead of dropping down to serial processing).

No, predicates do not require different instructions to be issued for each fragment. With predicates, instructions are still executed, the results simply aren't written to registers. It's a write disabling mechanism, that's all.

The penalty of flow control

Scarlet

KimB

Simon F

Tea maker

991060

KimB

Xmas

Porous

KimB

Frank

Certified not a majority

Frank

Certified not a majority

MfA

hstewarth

Frank

Certified not a majority

DW Fan!!!!!

MfA

Frank

Certified not a majority

hstewarth

psurge

Humus

Crazy coder

Frank

Certified not a majority

DemoCoder

Similar threads