I would also like to know, I'm not and by far an armchair expert on the matter, my original comment came from my realization about how "huge" texture units looked in R770.
As you saw I've been most likely mislead by some die shot (cf 3Dilettante's post). In the end it's a bit unclear how huge are the texture unit vs the ALUs. The other part of the problem is how much more efficient texture units in perf/mm² vs even reworked ALUs arrays. I've no answer I was trying to initiate a discussion on the matter.
My feel is that GPUs will lose some of their fixed functions units in the upcoming years.
It looks like GPU will have to become more flexible and more "autonomic" to satisfy the needs of developers from various fields. By autonomic I mean in a CPU way. Devs want to have read and write caches, on the other hand it's not the model that offer the best scaling and hardware cost can turn prohibitive. My feel (realy a feel more than anything else) is that the removal of fixed function would help to design chip easier to design from the hardware vendor POV and would ease the development of the matching software model/layer.
One of my other "feel" is in regard to Intel choices, I read a lot of comment about their choice of X86 for larrabee ISA on the other hand I see really few arguments about the real architectural choices on which larrabee stands: many many cores augmented by potent SIMD units.
I can guess/see the merit of this choice but I'm not sure that one simplistic CPU core augmented is the sweet spot, the proper building block to build the GPU 2.0.
Such a core is tiny so you need many of them to achieve the throughput the industry is aiming which imply quiet some communications whereas it's for memory space coherency or simply move data around.
You end with relatively few "FLOPS" to amortize the CPU front end/logic (even if one could do better than X86).
It's a really interesting debate and I would absolutely love to see more educated persons discuss the issue.
For example what would be the pro/con of say a properly done larrabee clone vs something looking a bit like a mix of a Cell and a GPU.
What I mean by a mix of a GPU and the Cell is an idea I got after reading a link about "COMICS" Mfa posted sometime ago. If I'm right "COMICS" is a software layer that provides software cache coherency for the SPU. In this setup I find that the PPU acts a bit in the same way as the Command Processor in a GPU does. So I wondered if a general purpose GPU or GPU 2.0 could be something like a blend of this idea.
Say you have some SIMD arrays attached to a specialized CPU. Each SIMD array would be more potent than they are now (a bit like SPU who can't do everything a standard CPU does). The specialized CPU would be completely hide from the developer and would do the job a CP does in a GPU and could do what the PPU does in the aforementioned set-up providing various software coherency model for the memory (under the programmer command).
I feel like this would have to be tinier than actual GPU so the specialized CPU would also handle load balancing, memory coherency between its "brothers".
Say you could have such a core made by:
one specialized CPU with its own resources, that may have hardware cache(hidden from developer)
8(*) "16-wide arrays" each with a dedicated LS as in SPUs (* random number, may be I love quadratic number as KK
)
a bigger LS shared among the SIMD array.
A router
You put some of these cores on a grid (aka Intel SCC or Tilera products) but I would be less of a burden than to put X time as many tinier cores ala larrabee (simplistic CPU =SIMD) on a chip of the same size to achieve the same throughput.
In the end developers would have a lot of choices anywhere from seeing each individual SIMD arrays to handle the chip as a nowadays GPUs.
You would pay for memory coherency (whether is die space, power consumption, etc.) only for the tasks that require it. Like Sweeney suggested in a presentation if you want you could run in software a coherent transactional model on some of the SIMD arrays if you need.
Such a chip may be easier to design, offer more perfs both per Watts and mm² but it would put a lot of pressure on the software layer/runtime that would make its use practical.
What do you members think? Is that a crazy idea? A brain fart?