You're doing it again
If CUDA's success is all marketing then why can't AMD even come up with something that makes their stuff shine? Don't you see the fallacy in your argument?
No, you'll have to spell it out. Since when is marketing a measure of technical capability?
Right, because writing code for AMD's hardware would be just as easy.
Perhaps you'd like to explain what it is about NVidia's 3 ALU-issue that makes it easier than AMD's 5-ALU issue?
Exactly, I've seen several papers of folks trying to delve into this stuff using CUDA and none of them ever concern themselves directly with ALU utilization.
I suggest you go re-read them. Any time you see someone evaluating FLOPs they're doing precisely that.
When I started reading this stuff I was expecting to find that NVidia's ALU design was a big win, freeing developers from iterations of algorithmic evolution (vectorisation, vector fetches, unrolling) to maximise performance. It's not remotely true.
As you pointed out many times there are considerations for memory bandwidth but the packing of VLIW instructions is something that just doesn't come up with CUDA. Yet you keep arguing that it's trivial and is an advantage that should be completely ignored.....
No my argument is really much simpler: once you have built an efficient algorithm there's so much instruction-level parallelism that having a scalar or VLIW ALU is often neither here nor there.
There are degrees of architecture-specific tuning you can do like I said earlier: if you have a GPU with crap caches and small per-strand state (NVidia) then you are forced to use shared memory for SGEMM.
And remember, NVidia has 3 ALUs, not 1. Of course another corruption in CUDA-think is that the only ALU whose utilisation matters is the MAD. Yep, marketing-101 for the win.
Games sure, but we're not discussing those are we?
As I said earlier, feel free to find the algorithm that is optimal as code with nearly no instruction-level parallelism. Everything else is just like graphics, loaded with wodges of ILP.
The primary mitigating factor is the control flow divergence penalty of different architectures. Getting evidence for the penalties in these architectures is really hard.
And we don't know what Larrabee will do compared to upcoming hardware. Comparing it to an architecture from 2006 isn't really a worthwhile exercise it it?
No we should just stop talking till GT300 arrives.
CUDA has already proven itself. Where are all the GPGPU algorithms that are running better on AMD's hardware? That's where the burden of proof lies.
There's no proof they run better on NVidia's hardware. The comparison, generally speaking, hasn't been made. CUDA has been an easy choice for a lot of people because it's been shoved under their nose by NVidia, and AMD in its own wisdom is building relationships mostly with commercial partners who don't publish papers and publish "speed-up" graphs with lies about speed-ups based on un-optimised CPU code.
The person who I mentioned I was talking with, earlier, doesn't have anything else enlightening to say, so I've got nothing to add on that particular data point.
Can't blame them. It must be Nvidia's marketing smoke screen that's preventing people from seeing them
Yep, it's working on you.
Before long you'll be ribbing me for suggesting that version 1.0 of AMD's OpenCL isn't optimal, give it a while yet. I suppose that'll be all the proof you need.
Jawed