Doesn't sound like quads to me.
Could be they stilll use quads to maximize the benefit of the clusters data cache.
Doesn't sound like quads to me.
What the same kind of "not quads" as R580's 12 pipelines per cluster?
Jawed
I don't think that's an accurate description. Instructions aren't executed in groups of three.Which were 4 "old-fashioned pipes" with 3 shaders each, if you wish. Thus a quad.
Every GPU for the last 7 years has worked on quads and this will continue. There's no other logical choice for texturing LOD and derivative instructions, especially when they're dependent on pixel shader calculations.I have no idea about low-level G80's internals, but from what we have so far, it doesn't scream "quads!" at me.
How many of you believe that it hasn't been unleashed on purpose? I'm raising my hand and I hope it won't cost me a second public apology LOL
I think at this point, it's fairly clear it's not currently exposed whatsoever. The question, of course, is whether future drivers will expose it...And, it is right. So far no-one has been able to write a shader that gets the extra MUL.
But I suspect you had something more machiavellian in mind!
The only thing I haven't tried yet is to try out the other PCI-e slot, which is nonsense for one and second I've broken some parts on the back of my case to fit it in already.
OK, well that looks to me, then, as though NVidia's marketing department got a hold of the MUL.I think at this point, it's fairly clear it's not currently exposed whatsoever. The question, of course, is whether future drivers will expose it...
I recently found a NVIDIA doc (from techday iirc, but it wasn't published so I'll assume it's still that it's still under NDA, theorically speaking!) that lists the G800 as having ~345GFlops, and not ~500GFlops, btw.
That's a good rationalisation, so in a sense we already do have the MUL, but only because it's performing texture address calculation.The correct way to think about it, of course, is that on NV4x, the first MUL was generally used for perspective correction. On G7x, one of the MADDs was also often reserved for perspective correction. For G8x, it never is. As such, considering inefficiencies etc., it might be fairer to compare G80's ~350GFlops with ~125GFlops for the G71's PS, plus ~50GFlops for the G71's VS. Thus, assumign perfect VS-PS distribution for G71's configurastion, G80 should be at least twice as fast in term of arithmetic, and more if it was unbalanced for G71. I think that's roughly what we're seeing, TBH.
And ATI's never counted it for GFLOPs marketing. I think asynchronous texturing introduced by R300 is the root of the dedicated MUL - it's always a vec2 operation, isn't it? But actually my GPU history is much too shaky,ATI has had a dedicated MUL for perspective correction since R300, if not much before that! The R500 has one too, and you'd expect R600 to have one.
I think that's it then, there is no dual-issuable MUL. Score +1 for NVidia marketing.It is surprising that NVIDIA has decided to no longer let the MUL be used for generic shader programming, while they used to in the past, and it looked like a good design choice to maximize unit utilization. This is especially true given that the SF unit is used to calculate 1/pos.w! It'll be interesting to see what they decide to do in future designs, of course.
Heh, guess I got really lucky! My 8800 GTX just missed one of my hard drives by about 2mm!Ouch. I feel ya. I had to remove a cable management thingie on mine that was directly behind the PCIe slot --had to have that extra 1/2" or no go. We asked them if they'd prepared a chassis compatibility list like both IHVs have down with PSUs for SLI/CrossFire. . . . I still think that would be a good idea. If AMD's new flagship is of similar length, I hope they will consider it.
Completely generic dual issue is expensive because you need to worry about routing so much data from the registers. This is probably why NVidia stayed away from this.It is surprising that NVIDIA has decided to no longer let the MUL be used for generic shader programming, while they used to in the past, and it looked like a good design choice to maximize unit utilization. This is especially true given that the SF unit is used to calculate 1/pos.w! It'll be interesting to see what they decide to do in future designs, of course.
Heh, guess I got really lucky! My 8800 GTX just missed one of my hard drives by about 2mm!
In a scalar architecture, it takes 4 scalar input registers per vector component.BTW, Uttar, have you tried to see if cross-product-like operations are single cycle?
cross(a,b) = madd(a.yzx, b.zxy, -mul(a.zxy, b.yzx))
This type of calculation won't need extra register inputs.
Given that XPD is a vec3 operation, shouldn't it be either at 133% (if the MUL is usable) or 66% of MAD4?Check this out:
http://graphics.stanford.edu/projects/gpubench/results/
Interstingly, R5xx, NV4x, G7x, and G8x can all do single cycle XPD, but R3xx/R4xx and NV3x can't.
At first I was thinking "Crap, well there goes my theory", but on second thought I'm not so sure.In a scalar architecture, it takes 4 scalar input registers per vector component.
Given that XPD is a vec3 operation, shouldn't it be either at 133% (if the MUL is usable) or 66% of MAD4?
Wouldn't that be still 5 cycles if there is a limit of 3 input registers?I also thought about the same thing with the registers, but maybe the access structure is not quite like we think it is. More likely, maybe the execution of the XPD is rearranged cleverly. The 6 MULS needed for an XPD can be grouped into 3 pairs that only need 3 inputs each.
I'm not sure what you mean here. Why would you want to replicate channels when writing to registers?Channel replication already requires more than one write port so that's not an issue here.
Ugh, you're right. Can't pair a mul with a mad without exceeding 3 input registers.ARB_fragment_program defines XPD as a vec3 operation: "The w component of the result vector is undefined." So that should not take any extra cycles.
Some of the results are really puzzling, though.
Wouldn't that be still 5 cycles if there is a limit of 3 input registers?
I'm just saying that writing two results in a cycle, like in your step #1, shouldn't be an issue. You mentioned that 4 different inputs (read ports) is too many, and I'm just saying 2 outputs (write ports) should be fine. Don't worry about it, I was just thinking out loud I guess.Xmax said:I'm not sure what you mean here. Why would you want to replicate channels when writing to registers?
That's an interesting idea. I wonder if such an accumulator would be accessible to both the MUL and MADD unit.What if there was a single accumulator register that was always accessible and doesn't take away from read ports? You wouldn't need the huge multiplexer to access an arbitrary register, so it would be cheap.
1: tmp = az * by, cz = ax * by
2: cx = ay * bz - tmp, tmp = ax * bz
3: cy = az * bx - tmp
4: cz = -ay * bx + cz
DP3 and DP4 could be accelerated too. Maybe one cycle saved for every two dot products? The results are weird. These things are definately outside the margin of error.