The arbitrary R/W in the LDS is important too.
You mean write anywhere? That's a feature of R700 LDS too.
That's not to say that R800 LDS doesn't work better than R700 LDS.
I'm not convinced there are any big architectural leaps left to make,
You mean on this side of GPU design, as opposed to the Larrabee-like future?
DWF seems something which can handled in software ...
But only for clause lengths > X, many times greater than a hardware implementation? Also, is DWF able to stand-up to the strain of nested branching?
the only important leap left to make IMO is to fold the pixel cache into L2 (making it read/write, with coherency being guaranteed by relatively simple fences ... doesn't give the low latency cross core coherency of Larrabee, but I don't think that's really necessary).
This is one of my big questions about D3D11, as it seems to declare open day for out of order pixel shader memory-accesses.
R800, by the sound of it, has beefed-up buffers as a step in this direction. Additionally the ability of TUs to read render targets sounds like there's a connection of data from RBE cache to L2 (which is for TU), in order to provide a monster pixel data bandwidth into the ALUs. (That's a guess).
But I'd still like to know more about what's happening there.
After that I don't really see how it will be much more difficult to program than say Larrabee, if you want to use the option of using the LDS with their comparitively huge gather bandwidths it will be harder to program ... but it's good to have options.
L2 in Larrabee with 32 cores at 1.5GHz provides about 3TB/s of bandwidth. We're looking at 1TB/s L1/LDS (guessing LDS bandwidth) in RV870 and 435GB/s L2->L1. GT200's shared memory bandwidth is about 1.4TB/s, it would be reasonable to expect ~doubling in GF100.
I still think shared memory is a short-term fix that'll hobble programming these things later on.
Oh and Ct is getting closer:
http://makebettercode.com/ct_tech/survey.php
even if Intel appears to believe that it's an interim thing.
Jawed