liolio, one big advantage of a fully integrated GPU is that it does not need to pass intermediates around via memory. If the vertex processing immediately feeds the rasterizer and that immediately feeds the pixel processing and that immediately feeds the ROPs, you avoid a ton of storage and bandwidth needs for intermediates. If you split the work along the way, you also have to solve (new) storage and transfer issues.
Simple case: forward render, diffuse+specular.
A traditional GPU stores color and Z per pixel in a single pass and is done.
Split it after the rasterization and you suddenly need a place to at least store interpolated texture coordinates to feed your screen-space shading pass, and you also need to budget bandwidth to write it out and read it back in. This goes further up for every vertex attribute you need during shading.
In terms of silicon expenses, a traditional GPU doesn't do this totally free either, but the machinations to buffer and pass this data around are all inside the chip. There's no intermediate data going out to memory and back in. Only the final results.
Thanks for the insight
I was a bit concern about it as you'll have to store quiet some data per pixels if you want vectors to "resume" shading till I realize that situation would not be worse than for deferred shading. You store a lot of data per pixel in you G-buffer. There would be more to the pool of fixed hardware than rasterizer(s) and texture units: some form of fixed pixel pipeline (or as I said if you want to keep more choices/programmability some shaders). So I don't think it's a problem.
I don't see on chip communications be more a problem than it is now, whether the "remains of GPU" is fixed function or not there would be buffers.
In console set-up freed from the directx compliance one manufacturer could pass on ROPs all together and favor MLAA, no matter its drawbacks it's quiet a Saviour in memory consumption and overall quality is great. Actually I would see the thing works a lot like a Larrabee.
Vectors units process geometry in bins like in larrabee and then send it to the "pixel pipeline" which create the G-buffer in RAM, then the the vectors units would act once again as larrabee is intended by processing tile that fit within their local memory. Actually it could save external bandwidth.
I'm concern about thing like vertex texturing, say you handle tessellation with the vertor units and then you want to use displacement mapping you have to send data to the pool of fixed function hardware for texturing. Result will be put in buffer on chip or in RAM. Vector units will process it when available. But is that all different than what happen today?
I realized that texture units are next to SIMD array for a reason so once again there would hardware on top the triangle set-up/rasterizer, texture unit and command processor.
That's why I consider fixed function hardware to minimize the cost.
Point is the amount of pixel won't augment much in the near future, even less in the console realm, the cost would go down.
Say the cost were as much as 50% of the die now @40nm, it would already be down to 25/30 in 2 or 3 year @32nm when next generation system will launch.
EDIT 1
I think have should have describe the thing from scratch as "having a VPU and GPU on the same and try as hard as possible to minimize ther GPU cost" it may have been clearer.
EDIT 2
In regard to the 50% it's just for the sake of the discussion I think cost would be way lower.