Jawed (and whoever else might want to comment), what do you think about Theo's comment "The alleged specifications of ATI Evergreen reveal that this chip is not exactly a new architecture, but rather a DirectX 11-specification tweak of the RV770 GPU architecture."
My original theory, going back to the arrival of R600, is that this is an architecture with a very long lifetime - and there's plenty of hints that ATI designed R600 with a lot more capability than D3D10 ended-up with, as D3D10 was cut-back. At the same time there are hints that 10.1 has capabilities beyond anything planned for 10. So it's all a bit murky.
It seems to me that GS/SO were particularly hard-hit, i.e. amplification was seriously curtailed. But if TS was in the future, what was anyone aiming for in GS/SO anyway?
At the same time I've been wondering whether the fine-grained and more pervasive nature of memory operations in D3D11 requires an overhaul - which is why I was asking about whether the pure latency-hiding architecture is enough on its own or whether more advanced caching is required (with a nod towards the pre-fetching that already exists in regular caching of texels for ordinary texture mapping).
And then we get into routing bottlenecks and crossbar madness.
This get to market first approach seems rather ingenious to me, doesn't seem like much in DX11 is radically different from what ATI had with DX10.1 (beyond shared memory stuff). Bump up registers and local store to CS5 level,
LDS needs doubling, but I'm not aware of any effect on registers.
insure 32-bit (u)int atomics work,
It occurred to me recently that global atomics aren't currently a part of CAL programming (or I haven't found them) - there is atomicity at LDS/shared-register (i.e. both at SIMD) level, and this atomicity is very much in the sense of any operation, rather than the D3D11-style integer atomics. This is because when a clause is underway on ATI it's uninterruptible, so as long as the entire atomic update is within a single clause it's atomic by default.
So it seems to me that atomics in the D3D11 sense are entirely new, even if D3D11's thread group shared memory atomic operations are trivial (i.e. they're just restricted in data type versions of what the clusters already achieve).
bump tessellation unit from the 16x to 64x level required by DX11
Seems non-trivial - though I've discovered that the amplification of TS is only on odd factors, i.e. from 15x to 64x is only 24x more vertices. The sheer quantity of data seems to imply that current ATI GPUs (including Xenos) use 15x for practical purposes - specifically because of on-die buffer capacity? Simply because of setup throughput?
64x in D3D11 could be just another bonkers limit, like 4096 vec4s per pixel of registers.
and implement whatever is necessary to run VS+HS directly to global memory with a second pass for TS+DS+PS (GS as per R700). Append/consume is easy to do with atomics.
SO is append, and currently supports 4 streams bound at one time. As far as I can tell vertex fetch is consume. So the atomicity required in coordinating all clusters already exists in ATI. It's just a question of whether it scales.
About the only possible wildcard I can see is DX11 R/W render targets. Seems (from what others have posts) as if DX11 has only 32-bit int and 32-bit unsigned int atomics, so using atomics on individual (8-bit or 16-bit) components of a 32-bit (or 64-bit) render target doesn't seem possible (would instead have to do a CAS on the entire pixel and a retry loop on CAS failure, 64-bit would be a mess without 64-bit atomics). Which really likely makes unordered R/W to render targets as marginal in usefulness (using a full 32-bit value per component IMO is not an option unless working out of cached memory). So seems as if the memory export ability of R700 would be fine for DX11 R/W RT access...
I don't understand why D3D11 R/W would want to be per-component in a RT
RTs are always multiples of 32 bits as far as I can tell.
The way I see this is that the developer is on their own - if they want atomicity and a specific ordering on RT R/W they need to roll their own - will it be faster than multi-pass?
If anything, the disaffection for GS/SO may be repeated in much of the new stuff in D3D11 - e.g. if it turns out that bandwidth and latency are much too dominant - at least for the first generation of GPUs.
Jawed