G80 programmable power

Eh? Isn't this some repeat of an old discussions here?

http://www.beyond3d.com/forum/showpost.php?p=895509&postcount=351

Personally I didn't realise that the MADD ALUs were decoupled and seperatly threaded from the SF ALUs.

But after checking..... here in the G80 Arch article :

Special function ops (sin, cos, rcp, log, pow, etc) all seem to take 4 cycles (4 1-cycle loops, we bet) to execute and retire, performed outside of what you'd reasonably call the 'main' shading ALUs for the first time in a programmable NVIDIA graphics processor.
 
But after checking..... here in the G80 Arch article :

Yep I saw that line too. But to be fair "performed outside" doesn't really paint a clear picture of independent scheduling. But given the different execution times it does seem more obvious now that the main ALU and the SF units are fed independently. I'm actually surprised this wasn't fleshed out in a bit more detail - it seems like a pretty big deal.
 
Heh, I just noticed that Jawed's approach here would always give R600 the per-flop efficiency edge due to its assumed 2-cycle MAD. How about we invoke a new standard - per channel utilization! :)
 
But after checking..... here in the G80 Arch article :
Special function ops (sin, cos, rcp, log, pow, etc) all seem to take 4 cycles (4 1-cycle loops, we bet) to execute and retire, performed outside of what you'd reasonably call the 'main' shading ALUs for the first time in a programmable NVIDIA graphics processor.
But that text is wrong. SF is pipelined to produce one result every clock. Each of the four MI/SF units in a shader cluster can produce 1 SF or 4 MI results per clock.

There is no looping in the MI/SF to produce any results.

As for "outside", all that means is that SF doesn't share an ALU component with MAD functionality, there's no multi-threading implied there. In prior NVidia GPUs SF was shared functionality on the fourth component, Alpha.

In NV4x/G7x each of the superscalar ALUs shares the SF workload. I can't remember the exact division, but for instance RCP and RSQ are on the top ALU while the rest are on the bottom ALU.

Jawed
 
But that text is wrong.
It might be 'wrong' (I have better things to argue with right now), but it's also a lot less wrong than the majority of the bullshit you write on these very forums. Not only that, but unlike many of your posts, we also marked it very clearly to imply that we were unsure when we wrote that. You could argue the same quality standards do not apply to posts than to articles, but I wouldn't really consider that as much more than an easy and improper way out of the arguement.

If you think the way this website should be handled is to have a single article that we update every 5 minutes, rather than new ones over time, then please be my guest and get the hell out of here (and yes, this is an exageration, so don't bother saying 'I don't think it should be every 5 minutes, maybe every 10 or so...'). If the place doesn't fit your tastes, nobody's forcing you to stay, and especially with that attitude and viewpoint.


Uttar
 
Heh, I just noticed that Jawed's approach here would always give R600 the per-flop efficiency edge due to its assumed 2-cycle MAD. How about we invoke a new standard - per channel utilization! :)
That's why I included pixel rate and pixels per clock. Also remember this is just messing about with "difficult" code, rather than code that easily utilises each ALU. It's not meant to be typical, but to illustrate how ALUs can lose utilisation.

Additionally it's worth noting that whenever there's a SF unit, you have a problem counting FLOPs (if you want to take FLOP counting seriously). An SF is actually more than just 1 FLOP, yet in R580/Xenos/G71 I've treated it as being 1 or 2 (because of the shared use for ADD or MAD). Their respective efficiencies would look worse if SF counted for more.

In G80 each SF is counted as 4, not 1, FLOPs. That's because (8x2 + 4x2) FLOPs per half-cluster x 16 half-clusters x 1350 = 518GFLOPs.

In R600, my hypothesis is that an 8-clock macro (2 ADDs and 3 MADs) is used to calculate SF, so in this case there's no argument one way or another about the relative FLOP cost of SF. I sized R600 at 512GFLOPs.

Jawed
 
If the place doesn't fit your tastes, nobody's forcing you to stay, and especially with that attitude and viewpoint.

What's with all the staff hostility towards Jawed all of a sudden? I'm guessing it has to do with more than just the stuff posted in this thread. You guys can't be THAT sensitive to criticism, or can you? :???:
 
What's with all the staff hostility towards Jawed all of a sudden? I'm guessing it has to do with more than just the stuff posted in this thread. You guys can't be THAT sensitive to criticism, or can you? :???:
I can't talk for them (I'm not parte of the site staff) but it's pretty clear to me that Jawed criticism is not exactly that kind of constructive criticism B3D is looking for.
 
What's with all the staff hostility towards Jawed all of a sudden? I'm guessing it has to do with more than just the stuff posted in this thread. You guys can't be THAT sensitive to criticism, or can you? :???:
I think it's fairly clear we had a fair number of valued members getting pissed off at him because he spews off so much inaccurate information, on a variety of subjects. And when it's not "information", it's 3 pages long posts of speculation that insiders literally roll their eyes at. That's obviously not true of every single one of his posts, but that's not the point either.

While this would be acceptable by itself, combined with the fact he has already crapped on the G80 Architecture article as having inaccuracies in, iirc, at least 2 other threads and a variety of other posts, I think this is getting absolutely ridiculous. I'd like not to have to be so harsh about this, but at this point, I doubt the message would pass otherwise.

I don't have anything personal against Jawed, and I don't really have anything against big speculative posts either, even when in the end, it turns out they were completely wrong. I had my fair share of those back in the days, heh - and some people definitely appreciate them when they aren't the vast majority of your posts, or of a thread's posts. Anyway, when an overall situation degrades below basic quality standards over extended extended periods of time, something is wrong. And when that is combined with him shitting on our work and implying he could do a much better job (that might not be how he means it, but it is certainly how I'm interpreting it, and various others seem to agree), I think it's about time to be made clear that this has to change...


Uttar
P.S.: This represents my viewpoint, although I believe it is shared by other admins - I'm not officially speaking for them here, though, and opinions and magnitudes may vary.
 
In my excitement at having solved the G80 MAD utilisation problem by sequencing instructions across the batch, I didn't notice that it actually requires scheduling across all 32 pixels in the batch to work, not the 16 I indicated :oops: The diagram's wrong, but the throughput is correct.

Bob, the pixel rates fell because I changed the final RSQ for a DP3.

Uttar, I'm sure PeterAce appreciates being told there's an error in the article - whether that article resides here or another site. You and I have already debated the error and you agreed that it is indeed so. And that was before the article was updated.

Jawed
 
Well, we don't all! Some of us remember your defense of Kirk's comments re NV30's 128-bit bus! :cool:
Interestingly, if I look at NVIDIA's roadmap, we might just have new arguements for that discussion soon! With R600's apparent 512-bit bus and NVIDIA's comparatively more conservative approach in the mid-end (128-bit + DDR2/GDDR3), it'll be very interesting to see how both companies' memory bandwidth efficiency will compare. Don't you just love how OT you suddenly made this thread, geo? ;) (not that it wasn't OT enough already!)


Uttar
 
Back
Top