Haha, it's not. That was my point. Why doesn't AMD dump some dollars into marketing and get their stuff up and running then?
I dunno, someone needs to ask them.
Sigh, 3-ALU issue? That's a bit dishonest. You fully well know that instructions from different threads can be issued to the ALUs independently and is taken care of by the hardware. Hence the developer doesn't concern himself with that. On the other hand all of AMD's 5 ALUs must be filled from instructions in a single thread so the developer has to ensure there's enough ILP available.
As I described here:
http://forum.beyond3d.com/showpost.php?p=1282350&postcount=12
unrolling increased ALU utilisation. Additionally, as we've discussed before, NVidia's compiler has to make a decision whether to compile MAD as MAD or split it into MUL + ADD - there are heuristics/models that do this. The aim is to maximise the utilisation of the MUL in the MI ALU.
Theoretically the double-precision MAD is also usable for single-precision MAD, ADD, MUL - I don't know if NVidia's compiler tries to optimise SP code across that ALU.
And have you compared them to the corresponding algorithm on AMD's stuff?
That's what I've been doing. Some of my posts take 10+ hours to put together because I've buried my head in other forums, papers, coding, optimisation etc.
Perhaps, but again this is all guesswork. We simply do not have the evidence on AMD's side to support your hypothesis.
It's not guesswork. Optimisation performed by CUDA developers usually includes FLOPs/byte optimisation, which normally means sharing bytes fetched/written and loop-housekeeping across more intense computation. The most important metric in throughput computing is arithmetic intensity, which is the ratio of computation to bytes read/written.
If you're really clever you maximise bandwidth usage and ALU utilisation simultaneously.
Shouldn't it be much easier to simply find such algorithms running on AMD's hardware as proof of concept? Where are they?
I've written code for you, in the other thread, to demonstrate this. And quoted SGEMM performance. What do you want, a thesis?
Huh, I didn't say that. I meant we should be comparing it to the hypothetical GT300. Not to G80 like you've been doing.
It's upto NVidia to deliver a genuine architectural improvement. When a reasonable rumour for such a change appears I'll bear it mind. Right now NVidia's double-precision is out by an order of magnitude. I won't say it's impossible...
Exactly!! So we're pissing into the wind with claims that AMD's hardware would be just as good or better. There's nothing to back it up so just as you hate Nvidia's marketing at least they back it up with results. On the other hand the stuff you're cheerleading hasn't produced anything worthwhile to date.
NVidia hasn't backed-up anything, since no comparison has been made. What NVidia has done is delivered an adequate toolset to go along with its hardware. All NVidia's comparisons are solely with CPUs. Often, with single CPU cores running unoptimised code
Brook+, specifically, is still in alpha as far as I can tell, with documentation aimed at geeks who've warped in from the 1970s and don't know better. I've got no idea whether AMD's OpenCL will be useful this year. Nor whether AMD will put together a decent environment and documentation for coding.
Haha, man I don't have anything against AMD. I just recognize and appreciate the good points of Nvidia's decisions for GPGPU. A perspective that you obviously do not share. What I don't get is that in spite of all of CUDA's success you work hard to point out its apparent flaws and are batting hard for an alternative that has not proven itself. And then blame Nvidia's marketing for the fact.
Look in the mirror. You don't spend any time questioning what's transpiring in the field, but want to be spoon-fed. It's pretty tedious.
OpenCL will actually be an interesting battleground. Because if, as you say, a well designed algorithm should have lots of ILP then anything optimized for Nvidia hardware should fly on AMD's stuff too. Unless by well-designed you mean explicit vec4 packing
No, optimised for "target x" does not get levelled-out by OpenCL, per se. That was the point of my earlier remarks about SGEMM.
As for what kind of arithmetic-intensity is optimal for any algorithm, well one of the interesting techniques that NVidia's brought to the table is "self-tuning" code (though this seems to be an old supercomputing technique, to be fair). There are different approaches to searching the optimisation space so then it becomes a question of whether the right variables for multi-target OpenCL tuning are emplaced.
I expect SPEC benchmarks will be drawn up at some point and then we get into the wonderful world of gamed-benchmarking and performance/watt and all the other stuff that the guys with big computers have been doing. Ah yes, graphics cards playing benchmarking games, they'll feel right at home.
Jawed