Do you consider gt200->fermi a major architectural overhaul? I would say so.
I agree. Almost nothing carries over.
And I think it will be something in similar vein. This is AMD's first gpu whose development probably began right around with Bulldozer. I would not be surprised at all if it was redesigned to merge with bulldozer at the core level, not just sit-on-the-same-die level like Llano.
Bulldozer's been work in progress for yonks, longer than the next GPU.
To merge with Bulldozer it needs to integrate with BD's memory system.
That's fundamentally a cache/MC question. Which, incidentally, is what I think got chopped out of Cypress. AMD retained the R700 memory system for Cypress.
I wouldn't be at all surprised if the next chip has the same ALU, TU and ROP counts as Cypress.
While we are it, I'll wager that in BD refresh (if not @ 22nm SOI) they are planning to take a couple (somewhere between 1-4) SIMD engines and put it into a bulldozer module.
The SIMD's will have their private L1 data and texture caches (just like BD
cores have theirs), and the L2 caches of the CPU and SIMD cores will be unified.
BD needs a GPU on-die, basically. AMD's strategy is OpenCL: CPU and GPU integrated makes for a compute monster. SSE will become irrelevant if you want throughput.
The idea of putting a GPU SIMD engine "in-line" as part of a BD module is problematic - it has a very different concept of registers and instruction streams. To do this would require an entirely new GPU SIMD. I won't say that's unlikely, but I do think it's a long way off since the x,y,z,w,t set in current GPUs is heavily refined.
I suppose it's possible to strip-down the x,y,z,w,t set for implementation within a BD module (re-work it for core clocks, remove the optimisations it has for clause-by-clause execution), but then it's hardly different or novel from what BD already has, except for the t lane (which is actually pretty useful, to be fair).
Jawed