Does it matter? The fact is that for efficient utilization you still need lots of threads with customized thread scheduling. If you fetch too many texels from a guess using the first fetch, then you get poor bandwidth usage.
yes it does. just look at your code if you do some SIMD optimization, you won't get away with performance if you just switch to SIMD, cause you add a lot of latency. e.g. transforming vertices my Mad(V,Mx,Mad(V,My,Mad(V,Mz.... wont give you much performance (especially on in-order Cores). you need to hide the latency by e.g. processing 4 or more vertices in parallel (it's not just loop unrolling, it's also static scheduling instructions to hide latency).
This is not realistic. Texture filtering is low precision (at full speed) and fits in a very specific data path so it has a fraction of the cost of a shader unit. Having it parallel also allows variable fetch cost (e.g. aniso, wide formats, etc) to be hidden by shader instructions.
I thought the R600 is always filtering in high quality?
It's a business decision because they can't achieve the high efficiency of ATI and NVidia. S3, XGI, Trident etc. failed to break through in recent years because of this, as you could definately see dependent texturing killing them in performance.
texturing is just a small part of the equation, and it's a very well known part. as I said, even onboardGPUS have enough of them beside all the other transistors, even sony and nintendo could add them to their mobile devices.
if you say it's hard to hide texturefetching latency, I'd agree with you. but if you say it's hard to build a TMU, I disagree.
It'll never merge completely due to fundamental differences in the workloads. If your units get too much larger to try accomodating non-GPU loads, then there's the opportunity for the competition to crush you in perf/mm2 by going back to basics. This is what happened in the G7x vs. R5xx generation. NVidia had record profits and margins, whereas ATI barely broke even. RV530 vs. G73 was particularly devastating.
your units wont get large if you manage to hide latencies. look at SUNs Niagarah, they aren't using as good processes as Intel or IBM does, but they manage to get 8cores on a DIE with 64threads. those 64threads are not a feature like they tell, they're essential to get those 8cores small. cause the transitor-count is not scaling linear with speed, for twice the Radix-division performance, you need 3 to 4 times more transistors (correct me if i'm wrong). so, on the opposite, having twice the latency, _can_ cut transistor cost to 1/4, allowing you to run 4cores, but you need to hide the latency somehow.
that's how "the basics" of GPUs work, it's not magic about highly optimized float units, highly optimized scheduling, highly optimized caches, it's just about a reaaaaly long pipeline with very coherent memory access pattern.
and that's why switching states/textures/etc is so expensive. cause you flush that pipe and if you do it frequently, it can't be even filled with enough load to hide the latency. (if you look at the recent NV gpu, they added mostly stuff to hide latency, like double the Temporary register count...)
that's also the big problem on GPGPU and why they'll move towards CPUs, you're very restricted in what you do if you want performance. running cuda with a raytracer that randomly reads from memory to traverse a BSP will completely kill your performance, ending up in far less than simple CPUs can render.
And I think those unified processors will be first on consoles, just because they have less backward compatibility restrictions, like we saw tripplecore-SMT on x360.