Another demo

The R300 does it in three cycles as it's expanded into a LG2/MUL/EX2 sequence. I would assume the GFFX does the same.
 
Humus said:
The R300 does it in three cycles as it's expanded into a LG2/MUL/EX2 sequence. I would assume the GFFX does the same.
It's a scalar instruction, though, so on the R300 it can be combined with a vec3 to form a vec3/scalar pair that can be executed in a single cycle, right? Thus, it effectively takes one less cycle to complete.
 
Humus said:
So 11 vs. 11 instructions there.

The card appearently has a 300MHz core and executes the shaders with a rate 6 cycles / pixel.
Adding extra instructions shows 0.5 cycles / pixel increment as expected.
 
Ostsol said:
It's a scalar instruction, though, so on the R300 it can be combined with a vec3 to form a vec3/scalar pair that can be executed in a single cycle, right? Thus, it effectively takes one less cycle to complete.

Do you see such an combination opportunity in Humus's shader?
I do not.
 
Ostsol said:
Humus said:
The R300 does it in three cycles as it's expanded into a LG2/MUL/EX2 sequence. I would assume the GFFX does the same.
It's a scalar instruction, though, so on the R300 it can be combined with a vec3 to form a vec3/scalar pair that can be executed in a single cycle, right? Thus, it effectively takes one less cycle to complete.

It can only co-issue instructions under kinda limited circumstance, one must work exclusively with rgb and the other exclusively with alpha. That's not the case with LG2/MUL/EX2 sequence, though it might be possible to squeeze some other vec3 instructions parallel with those scalar operations if I would have use alpha for it.
 
According so sireric, the scalar unit of the R3xx has a general purpose fmad, which can be used to execute the mul, in addition, the fmad can operate in parallel with the exp2/log2 functionality of the same unit. So, essentially, pow should be executed in 2 cycles, if the conditions are right.
sireric said:
There are 4 FMAD units, three reserved for vector units, 1 for scalar units. However, the scalar unit can kick in to give you 4 vector ops (dot4). Now, beyond the fmad, the scalar unit has a bunch of other units, including all the exotic functions (inv,log,exp, etc...), which can operate in parallel with the MAD. Those don't share the FMAD since it could not meet our timing requirements mixed with lut's. So, a simplified MAD was merged in to perform table lookups.
 
I don't see how the LG2/MUL/EX2 sequency could be executed in two cycles. It is directly instruction to instruction dependent, no parallelism there as far as I can see. Even if there's a free fmad unit it can't do anything before the result from the LG2 is available. Effectively it may be executed faster though as it may leave the option to execute some other instructions parallel with it for free. I'm afraid though that in this case though the conditions aren't met. Pretty much throughout the whole shader there's direct instruction to instruction dependency.
 
Actually, it depends on the pipelining, doesn't it?

Pardon my making up words as I go along :-?:

In a 3 macrostage pipeline, if one component per pseudoclock could be done, scalar dependencies could be resolved if the scalar operator was able to do one macrostage/pseudoclock for it's previous instruction.


For instance: if a subunit could output one component of a mul in one clock cycle, and a pipeline had 3 such subunits replicated for one macrostage of the "vec3" part of the pipeline, it could have each subunit cascade for dependency and process 3 different pixels simultaneously but take 3 clocks for each pixel (staggering output).

However, if the scheduler could analyze dependency and manage another, more flexible, pipeline (like the scalar one in the R300), it could have the choice of using one macrostage in that pipeline for the 4th component, or getting a head start on stage propogation delay for a dependent scalar operation (if it was told to do the dependent component calculation first).

...

What this doesn't analyze is the design cost for pipelining in this way and being able to schedule for it, have register replication for it, etc, but I thought it interesting to mention regarding considerations for future design and analysis of current hardware in the context of this statement about optimization opportunities.

Hopefully, I wasn't too sloppy in my wording and didn't make any silly errors or oversights.

I have a feeling this discussion reflects some things mentioned in prior pipeline discussions (and may have been pointed out to be fallacious in them), but I can't begin to guess right now which word I'd use to search for it (efficiently), though it should have been the latter half of last year some time.

...

macrostage: What I mean is that the actual number of discrete pipeline stages might differ...the "macrostage" is a convenience of representation for this particular case to maintain per clock output.

pseudoclock: An operation in a pipeline can take more than one clock to execute, but still output one per clock due to the number of simultaneous operations conducted in the pipeline. The "pseudoclock" is just a term for being able to implement that per clock output concept while in the middle of the "simple" pipeline concept.

subunit: Units are usually referring to something providing useful outputs, and can mean drastically different thing. I'll use subunits just to try to avoid confusion with statements like "the R300 has 8 flexible 4 component 24 bit per component floating point processing units" and other variations that might be valid depending on how you look at things.
 
when you guy's were learning to write code, did you ever consistantly have it come out of the compiler looking all screwed up? was it easy for you guys? The reason I ask, is because you all make it look so easy.
 
Demalion to the rescue!! (a good thing, because at the moment, I'm not up to rationalizing) ;)

Edit: :idea: Basically, what matters is the average rate of execution. Because of pipeline latencies, it is almost impossible for any architecture to have an instantaneous rate of execution (for any one instruction) equal to its average rate of execution (i.e. you can output 8 pixels every 2 clockcycles or 4 pixels every single clock cycle).

Not sure if I flew on a tangent, but the same reasoning seems to apply to programs with data dependencies, functiong hardware which can exploit parallelism (man that was long). The hardware can mask those latencies and dependencies by operating on other factors simultaneously and masking inherent latencies.
 
Luminescent said:
Demalion to the rescue!! (a good thing, because at the moment, I'm not up to rationalizing) ;)

Well, I'm pretty fuzzy-headed at the moment...had a late night cooking one of my favorite dishes from childhood for the first time (came out great!!! :D) and just woke up. So, let's see how it looks after some scrutiny first. :p
 
Humus, and anyone else, this is what we're (demalion, if I understood correctly, and myself) trying to say:
because superscalar parallelism is not available (instructions are not independent), does not mean that parallelism, altogether, is not available. The parallelism offered through pipelining is very valid.

I'll use a direct example from Humus' experience:
If we have 2 subunits (in this case an fmad alongside an exp2/log2 unit), within a scalar unit, and the instructions are interdepenent:
-For the first pass, parallelism will most likely not be present (the fmad is depending on the exp2/log2 unit for a result)
-For the second pass, the fmad can be working on the first pass' mul instruction, while the exp2/log2 unit is computing the parameters for the second pass, in parallel with the fmad, and the process continues.

I'll get some more info on the matter, though. The issue is only a technicality, but it is what B3D is all about: discovering hardware technicalities and knowing the explicit nature of the architecture, for the betterment of programming, applications, our minds, and the industry as a whole.

P.S. Pardon me if I got carried away. Humus and a large number of B3D readers are significantly more qualified than I.
 
Admittedly, I'm more a software guy than hardware guy. I tend to think of hardware in more simple terms. The most complex piece of hardware I did at university was a unit reading a standard PC keyboard, decoded it into ASCII and sent the char in an IP packet with CRC and everything over to a computer with a packet-sniffer outputting it. Never digged to deeply into implementing pipelining though and all it's issues and sub-issues.
Anyway, it's very interesting of course all this, and I figure it is possible then to get it to run in just two cycles. The real question in the end though, is it actually doing that? So I first ran the app normally, got 654fps. Then I replaced the POW instruction first with a MUL. Got 768fps. Added another MUL, got 706fps. Added another MUL, got 654fps. So obviously the cost is the same as three MUL instructions.
 
Edit: If you saw something I had here, in the form of a paragraph, forget it, I mentally short-circuited.

Hmmm, I should be getting a reply on the issue soon, so I'll get back to you. In theory pow should execute in two cycles, but who knows, maybe the compiler is "twitchy", or it is explicitly expanded by Ati to three instructions.
 
A little OT, but how is pow actually convertet to log2 / mul / exp2 ?
Tried playing with it, but cant seem to get the right result... :? The only example I could find of an assebler function was one that was using recursion.

How would a CPU / VPU approx. a pow function to gain speed?
 
Tokelil said:
A little OT, but how is pow actually convertet to log2 / mul / exp2 ?

POW result, base, exp;

becomes something like

LG2 result, base;
MUL result, result, exp;
EX2 result, result;

Comes from the mathematical rule:
a^b = 2^log2 (a ^ b) = 2^(b * log2 a)
 
Unless there's some trick I'm not thinking of:

x^y = power(x,y) = exp(y*log(x))

So, that's 3 instructions, for R3xx. If x was a constant, you could reduce it down to 2 instructions. You could possibly do a texture lookup for log(x), which might be "parallelizable" (?).

It's a pure dependancy structure, so no real parallelism exists.
 
I guess I didn't count on the third instruction (exp2) as dependent on the second (mul) instruction. If that final instruction was not dependent on the mul, then the above example I cited might have applied; the mul and the exp2 would be able to function in parallel.

Oh well, we live and learn. :oops:
 
Back
Top