Saving bandwidth and requiring less registers. Oh and lowering total execution latency. If you want Larrabee to do 10 collision tests for your physics engine you rather have the result in 100 ns instead of 10 ms. It's ok for a GPU to lag a couple frames behind for graphics, but that's unacceptable for other real-time applications.
I was taking some of your points very seriously until I read that one. I really don't think you'd find the number of threads to affect latency too badly if you did the math (I know that number was in jest but still)... Especially if you could put your compute threads on high priority in the scheduler, which you likely could.
More seriously, regarding caching+prefetching for GPUs: I'm sure it would help, especially once you realize how big the memory burst sizes are going to be with GDDR5 so the chance you hit the right memory block is obviously higher. However, if you do waste it (which will always happen, otherwise it's not a *speculative* prefetch) then you've also just lost a bunch of bandwidth. So my point is this: you argue it'd save bandwidth. Personally, I'm pretty damn sure it'd *waste* bandwidth overall.
Look at it this way: if bigger caches improved perf/mm2 or perf/watt, you'd already see those in current products because that's a 5 minutes change. But you don't. Reuse just doesn't go up that much in graphics. Can prefetching improve the hit rate? Yes, but that won't make you reuse the data magically. On the other hand, when your bandwidth utilisation could be maximised anyway, doing prefetching to hide latency *will* waste bandwidth. And it also won't hide latency as well or as systematically as registers+threads. yay?
One positive point for Larrabee I thought I'd mention though: if you do texture filtering in the shader core, your ALU-TEX ratio is higher so that doesn't hurt to hide average latency... It wouldn't help with a chain of random dependent texture fetches, but errr, I doubt it's supposed to either!
(oh and btw, fwiw, current GPUs are already smart enough to issue multiple loads per thread/warp to maximise latency hiding and I suspect small LUTs that are frequently used will likely also remain in there if your cache trashing isn't out of this world...)
Thanks for the info. What kind of registers do GPU's use anyway?
afaik, it's multi-banked 6T single-port SRAM (1 read port+1 write port though for a variety of reasons but that's not much more expensive than a shared port iirc).
Aren't we getting close to the point where adding more registers to a GPU to hide latency wouldn't help because we're bandwidth limited anyway? Doesn't it make sense then to spend those transistors on caches that reduce both bandwidth and average latency?
I don't think we're going to be horribly bandwidth limited in the 2009 timeframe personally given that we could see NV/AMD use 6GHz effective GDDR5 on 384-bit busses or something crazy like that (yes, nearly 300GB/s!) *if* they needed it. Which is not completely obvious given how much it is but we'll see how it goes.
EDIT: Added a bunch of stuff.