Is CUDA's success solely a function of Nvidia's dollar investment and marketing push or is there something to the technology too?
Apart from the technical merit of the malleability of shared memory and investment in toolset (which was pretty ropey at the start), I'd say CUDA's success is 80% marketing. There's a lot of universities with free hardware from NVidia etc. It works for games, why not in GPGPU?
A huge amount of CUDA buzz is centred on the low-hanging fruit - e.g. the N-body problem that we've been discussing, which is embarrassingly parallel. With half or a quarter of the FLOPs NVidia would still have success in these cases.
The configuration of the scheduler and ALUs has squat to do with NVidia's success.
Indeed, it's reasonable to argue that NVidia's absolute performance is irrelevant, just as long as it's more than the fastest quad-core CPU (which isn't difficult when you only count single-precision FLOPs). Sadly there are quite a few "scientists" out there, as well as NVidia staff, who claim speed-ups for CUDA implementations based on entirely un-optimised CPU code (the N-body stuff we've been discussing is a case in point). Some of these people are so corrupt they make their comparison based on a single CPU core.
Still, there are plenty of people getting real speed-ups, things like option pricing, MRI post-processing and seismic computation - the list goes on
Based on what metric? As far as GPGPU goes it's been a remarkable success compared to the non-existent competition.
Perhaps you'd like to explain how NVidia's style of instruction issue has provided any notable benefit solely in its own right.
Of course you can argue that the bloated size of GT200 is irrelevant if you sell them at $5000 a go.
I'm not really sure why you think it's just bloat. What alternative architecture do you propose would have put them in a similar or better position than they are in today?
You mean how could NVidia have made a smaller chip with the same performance? Well, it's staring you in the face, it's called RV770. Larrabee will, I expect, show an advantage too - and blow ATI and NVidia's doors off when it comes to double-precision.
I'm baffled as to how can you draw these conclusions with nothing to compare against? Where is AMD's scheduler light architecture excelling exactly?
http://forum.beyond3d.com/showpost.php?p=1220350&postcount=27
http://forum.beyond3d.com/showthread.php?p=1263635#post1263635 - and the following posts:
So a purely ALU-based comparison of performance per mm2 for HD4870 against GTX280:
- float MAD serial - 57%
- float4 MAD parallel - 273%
- float SQRT serial - 221%
- Float 5-inst. Issue - 239%
- int MAD serial - 137%
- int4 MAD parallel - 279%
Worst case, AMD's ALUs are 76% bigger than NVidia's when running serial scalar code. Most of the time they're effectively 50% of the size in terms of performance per mm2.
http://forum.beyond3d.com/showpost.php?p=1263635&postcount=494
Feel free to provide a list of GPGPU applications that are fundamentally scalar with practically zero instruction-level parallelism in the optimum algorithm on NVidia. I'm still looking.
Yeah that's unfortunate because that point renders these discussions moot. The fact that not even AMD has been able to produce something that highlights their architectures strengths is pretty telling to me.
AMD's SGEMM is faster than NVidia's, 300 GFLOPs on HD3870 versus 375 on GTX280.
I was referring to clause demarcation. That's done at compile time as well no?
Yes, that's right.
Well I didn't mention VLIW. I thought we were talking about scheduling.
NVidia implemented this style of scheduling as part of its non-VLIW approach. The 3 different ALUs in GT200's cores need to be fed - AMD chose VLIW to feed the 5 different ALUs in its cores.
A benefit of their configuration is being able to implement a small warp size, 32 - though my theory is that this may really be 64 due to the effect of pairing of warps - I can't find any tests, though I'm wondering now if Volkov's mysterious "minimum of 64" that we were discussing recently is in fact confirmation. I'm strongly convinced ATI has an effective size of 128 because of paired wavefront issue in its ALUs.
Larrabee's an interesting comparison point because it has a scalar pipeline (not sure how many ALUs) + a vector pipeline (VPU, SIMD-16). It's unclear whether it does superscalar issue or VLIW when looking at the scalar+VPU pipes and the flow of instructions. I suspect superscalar.
But within the VPU there appears to be static code scheduling, i.e. there is no dynamic operand scoreboarding. But that's a guess based on code snippets and the conceptualisation of fibres, no more. Obviously there's dynamic fetching of data from cache lines. Register fetches appear to be extremely simple, with 3 operands for one ALU (+ mask as another operand), for a set of 128 registers (4 contexts * 32 registers). It's very cheap, an 8KB register file, compared with 32KB in NVidia.
And that doesn't answer the question. Where are the apps that prove out the viability of AMD's approach as a general compute solution?
They're out there too, I've even posted links to them (and scolded some of the crap). Obviously I'm wasting time doing so because apparently people round here believe that AMD has zero GPGPU penetration.
Jawed