Isn't that the whole point of VLIW?
Yes it's doing a good job, isn't it?
I'm interested to see how ATI fares with heavy LDS workloads, though. Much like NVidia, avoiding bank conflicts is key. I can't tell which architecture is going to suffer worse from stalls. I think NVidia has an advantage because accesses are treated like gathers and it's the operand collector's problem, rather than making the ALUs stall immediately. NVidia, I think, only stalls once the hardware thread population is exhausted (i.e. latency hiding from other threads).
NVidia's too-small register file, which means less latency-hiding to avoid stalls, may be an issue though.
A good way to avoid stalls in ATI is to increase vectorisation per work item, i.e. increase instruction-level parallelism. But register file still constrains that, ultimately.
I think it's fair to say that local memory is going to be key to the performance of a lot of GPGPU algorithms - though I still think it has a future like that of Cell SPE LS. But there'll be an interim where the only competition, ATI and NVidia, is purely local memory focused. NVidia has already signalled at least a partial step away with the dual-functionality local memory/L1 cache. Arguably the cache-hierarchy in Fermi changes the game - some problems will like cache more than local memory.
But then I also suspect NVidia has a better handle on the classical vector-computer techniques, scan, scatter and gather, which is why I think the 3rd generation GPGPU techniques that come out of people programming with Fermi will steal a march on OpenCL 1.0 techniques.
Where is OpenCL 2.0?
http://sa09.idav.ucdavis.edu/docs/SA09-OpenCLOverview.pdf
Page 8 says 1.1 is coming within 6 months and 2.0 is 2012
2 years
Hopefully 1.1 catches up with D3D11-CS. It seems to me that Fermi/CUDA 3 has about an 18-month lead on OpenCL 2.0.
Jawed