Since the accesses are unordered, doesn't that mean ATI can still use its L2 for reads?
It's possible that ATI can use it's L2$ (and maybe already does)... I don't recall whether it is already coherent across the chip, but in my brief testing putting "globallycoherent" or not didn't seem to make any different on performance or functionality, implying that they're only ever using globally coherent caches. The difference here is that NVIDIA has an L1$ that they imply will be used for both global and local memory caching (CUDA allows you to play with some of these parameters). This is interesting because since the L1$ is the same memory they use for LDS (they partition it) it should be low latency and casts doubt on to whether or not its worth explicitly loading the LDS, which basically now amounts to cache line pinning as I understand it. ATI on the other hand has a pure LDS for L1 and I imagine L2$ is a fair distance away in terms of latency. Thus I would imagine that you'd always want to explicitly load into LDS on their cards if you intend to reuse any global memory results, coherent or not.
It seems to me that any example that winds up using the r/w cache in the backdoor way you mention wouldn't actually be a compliant DirectCompute program.
Certainly for R/W it's not clear, but for R/O it's definitely possible to write an application that runs well with a cache and not with an explicit LDS and I expect people will write these without knowing it if developed on NVIDIA hardware. Your question relates to my early aside about the usefulness of globallycoherent at all though... with the very few guarantees that DC gives you, I'm not sure that you can ever actually make use of this coherence safely.
Or are you thinking of a program with a truckload of memory barriers that Fermi basically ignores due to the coherency of its cache?
I don't think Fermi's caches are necessarily fully coherent on *write*, but I could be mistaken. Granted, the rather relaxed execution model of DC in particular implies that you might still be able to make a compliant implementation without respecting this. Things like CMPEXCH may cause trouble here, but for most operations if the order of execution isn't defined then coherence doesn't mean a whole lot... you can always argue that the discrete "views" that a CS group sees on memory are due to execution ordering even if they are actually due to loose memory coherence in practice (again, excepting some of the atomic-with-return cases).
BTW, where can I find documentation on DirectCompute? I'm basically relegated to looking at NVidia's GPU SDK examples.
It's not well-documented, but MSDN/help file in the latest DXSDK (Feb 2010) has some basics.
As usual one has to think about caches on GPUs as way to amplify bw, not to drastically reduce latency (which is not really interesting)
I'd argue that reducing latency is definitely interesting... you don't always have enough parallelism throughout your whole algorithm to keep the entire chip busy, *especially* if you have to cover long memory access latencies too. I think we're going to increasingly see algorithms with tighter feedback loops (iterative optimization stuff especially) and low-latency LDS/caches are going to be critical to these running fast. There's also always a tradeoff between the latency and storage that you need... the more latency your stuff has the more cache/storage you spend on thread contexts. At some point it crosses over in terms of hardware cost.
This might sound crazy (and it probably is..) but I wonder if in some cases it would be faster on a AMD GPU to build data structures in global memory via atomic ops, even if you don't need atomic ops, since they are supposedly performed on a globally shared and coherent (with respect to other atomic ops) r/w cache.
I'm sure it could be useful in some cases, but these operations still have a high overhead even without return values and even with fairly little contention. Excepting the (ridiculously) fast append/consume path, general purpose atomics to global memory are still pretty costly. That said, I'm sure there are cases where the alternatives are *more* costly