RV730 is 150mm2. RV740 is 136mm2. 14mm2 smaller and packing twice the SPU/TMU logic. Thats damn impressive to say the least, but here we are talking about a chip that is nearly 80mm2 smaller then RV770. A fair guess estimation would put Evergreen I would say 10%-30% faster then RV770. Not bad at all for it's size. This chip to me screams mainstream and the perfect replacement for the HD 4850/HD 4870/HD 4890.
Yes, this is like R600->RV670 - the ALU/TU/RBE counts were unchanged and clocks got bumped by 4%, the bus got chopped in half and the GDDR3 clock was raised by 36%.
Except if the bus got chopped in half and memory clock was raised, Evergreen would have ~45mm² to fill with
stuff. That's a hell of a lot of stuff, since I estimate that RV740's clusters are around 52mm².
Or, that's 45mm² of D3D11-specific additions
Or, that's 45mm² of D3D11 stuff + architectural re-jigging.
It's conceivable that the architecture needs a shake-up to handle the memory-intensive nature of D3D11.
It seems to me that D3D11 is making buffers (resources) of indeterminate count and size a more finely-grained component of rendering. Previously rendering consisted of consuming some fixed-size buffers whilst writing to other fixed-size buffers.
Geometry shading opened a can of worms making the output buffers variably sized. We've seen that ATI currently uses a pair of ring buffers to handle the ebb-and-flow of GS. Now D3D11 gives the developer access to their own arbitrarily sized buffers to be used pretty much whenever they feel like it (PS or CS seem the most likely places, and arguably CS is a distinct rendering pipeline all of its own) - though it seems there is still a hard limit on the population of these buffers bound to the rendering pipeline at any one time.
So it seems to me that there are now multiple sets of paired ring-buffers along the rendering pipeline. TS ouput is variable, GS output is variable and PS output is now variable. PS output is extremely arduous because:
- there can be multiple independent variably-sized buffers written by a single pixel shader
- the in-flight count of pixels is higher than for any other kind of graphics primitive
In R600 etc. the paired ring buffers rely upon latency-hiding to perform. I've not seen any analysis of GS performance that highlights the quality of latency-hiding - all we have are hints that R600's ridiculous bandwidth was a nod in the direction of making GS work well and that RV670 should show a significant shortfall due to it's much lower bandwidth.
So in D3D11 chips, is latency-hiding against ring buffers held in memory enough? Can a layer of cache provide any benefit here? Theoretically, while appending or consuming a variably sized buffer, caching works well, since all threads are focused on a single region of the buffer. In some ways this is the ideal scenario for caching and it's much easier than caching render back end tasks such as blending, where a stream of pixels arrives with "random" memory addresses (randomness ameliorated by screen-space tiling, though I don't know whether an entire tile can be held on-die in cache).
Can caching for append/consume buffers be re-used for RBE tasks? One of the properties of append/consume is that it doesn't "tile" - because all threads are focused on writing to the head. The head will move "slowly" through tiles in memory space, i.e. it'll move slowly through MC channels. This seems like a useful property to me, as it means that the MCs can be easily configured/scheduled to pre-fetch (while consuming the tail) and it means that sizeable burst writes can be done, e.g. the MC performing a single write after a wodge of data has been added to the head. This doesn't sound too different from RBEs holding entire/portions of a screen space tile - though the timing is skewed in favour of append/consume, where the lifetime of a block is much more coherent (bursty).
L2 in R600 is pretty large, hundreds of KB at least (not massive though). It effectively supports pre-fetching of texels by virtue of both locality and the fact that many texel coordinates are known before pixel shading commences. Much the same applies to append/consume, whereas RBE is pretty much stuck with some degree of randomness. So append/consume seems like it blurs across the functionality of texture and RBE caching, with the strong locality of texels, but the requirement to write.
Currently ATI effectively supports 128 vec4 registers per pixel (vertex, thread, etc.) before registers have to spill to memory. Is that the limit? It seems likely to me that ATI can't increase the register file without having to re-time instruction issue/execution. Currently the ALU and TEX pipes seem tightly bound to pairs of wavefronts in-flight, with an 8 cycle pipeline and effectively some multiple of that for register reads/reads-after-writes. So it seems to me pretty difficult to tweak the register file (e.g. double it) in order to substantially increase latency hiding or to support shaders with substantially higher register allocations.
So, maybe register spill needs to become a first class citizen. Register spill appears to be a coarse-grain variety of append/consume. When a wavefront is created its registers are allocated, D3D specifies that 4096 vec4s per pixel are allowed, that's 64KB per pixel. This is, effectively, a contiguous block of data, e.g. 128 registers is 2KB per pixel, so 128KB per wavefront, though it could be made up of smaller blocks. If register spillage is required then blocks of a wavefront's register allocation can be sent to memory. With round-robin scheduling and with the scheduler able to see the progress of a wavefront's antecedents (e.g. texture filtering) it can control the scheduling of fetching-back of those register blocks that were dumped into memory. All of this is bursty and looks amenable to simple stream-through caching.
So can a single cache take on all these roles? Or are dedicated caches better? Can the high quantities of append/consume traffic opened up by the really quite long and twisty D3D11 rendering pipeline be supported purely by uncached latency-hiding? Is ATI's current latency-hiding at its limit? Can register spill performance penalties be ameliorated by schedulable stream-through caching?
I'm not trying to suggest that all of the "missing 45mm²" is cache. I'm just wondering if the significant increase in the density of memory operations requires a massive re-wiring of all the major memory clients, perhaps with a new higher level of overview in scheduling and perhaps also in a new level of generality.
Jawed