Very interesting indeed.
Though I'm really wondering how they got die size down so much. By reading that, it seems like they didn't cut down on the numbers of simds, which is sort of hard to believe given size needed to go down from 400+ to 330 mm².
Size was more like ~480mm² if the diagram is to be believed. ~30% cut in die size. The cut was prolly more significant than that, because we know that the via solution for RV740 was "a doubling of vias" which made RV740 grow.
So what else? Sideport is mentioned, but that's probably only good for 10mm² or so at best. What else could have been in there? Wider internal data paths (though the article says "features" had to go)? Cache sizes (not that they take up a whole lot of space)? In any case it can't have been something which required rebuilding of whole blocks, as that would have led to a much larger delay.
For the sideport to be useful it prolly needs to be much meatier than that seen in RV770, because that sideport's bandwidth is nothing to write home about (it is literally superfluous). 10x more bandwidth?
Also I think a complete revamp of the cache system is due. Evergreen has two sets of atomic units: one set in the ROPs and another set in the LDSs. A cache system with one set of atomics close to the ALUs would do all this, making the atomics run on L1 which is dual-purpose L1/LDS. We get back to the old topic of making such atomics globally coherent, something discussed at length in the GT300 thread, which is a serious problem.
Getting rid of the ROPs also has implications for early-Z.
The "peculiarly asymmetric" handling of caches for RW that Gipsel and I have been discussing might be a side effect of AMD simply deleting the fancy cache stuff. Retreating to a slightly enhanced version of what's in R700? R700 only supports a single UAV, but Evergreen has to support 8 for D3D11 compliance.
There might be some clues in the "reserved" gaps in the opcodes seen in the ISA
e.g. these 12 values:
19 EXPORT_RAT_INST_DEC_UINT : dst = ((dst==0 | (dst > src)) ? src : dst-1.
31:20 Reserved.
32 EXPORT_RAT_INST_NOP_RTN: Internal use by SX only (flush+ack with no
opcode). Return dword.
The RAT ID actually has space for 16 UAVs (coming in D3D11.1?) but the bit range there is 9 bits ([8:4] is unused).
Then you get into the whole topic of whether faster setup is required. And whether that's predicated on enhancements in the cache system.
Also, I'd say they missed the 2x performance target. Granted it's there in theoretical flops but not really in practice, though maybe the 2x target was only in theoretical flops...
It seems to me "shader core" is the basis for that.
Jawed