Also, is it accurate to say that most of the arguments for Larrabee seem to revolve around the larger cache size and not the programming model per se?
I think these two deserves to be discussed separately: hardware managed cache, and programming model.
If we compare Larrabee and current GPU (from both ATI and NVIDIA), these are the two major differences. GPUs do not have a hardware managed cache. Of course, they do have texture cache, but they are read only. The same goes to the constant cache. On the other hand, Larrabee has CPU style cache, and it's coherent, which makes atomic/locking operations much more efficient. This can be very important for some applications.
Another difference is about programming model. Larrabee can support two models: the first is similar to GPU through LRBni, i.e. SIMD with gather/scatter, or the so-called "many threads" model. The second one is the more traditional SMP style, i.e. just use Larrabee as a multi-core CPU. Of course, to utilize the most of Larrabee's power, you need to use the first model, but the point is that if your problem is not suitable for the first model, you still have the second model. You can't say the same for current GPUs.
If we are going for the traditional SMP style multi-thread programming, then I agree that cache is the way to go. You really can't expect to hide a good amount of memory latency just from this number of threads. But since current GPU can't do that at all, I think the advantage here is clearly in favor to the Larrabee. Although some may argue that using this model on Larrabee is probably not going to be better than just using a normal CPU.
However, if we go for the "many threads" model, or the vector model, the benefits of hardware managed cache is not that clear. The idea behind the vector model (and the old style vector computers) is that using a large vector allows you to hide memory latency, so you don't need cache. However, to be able to hide the latency well, you need a relatively nice memory access pattern. For example, even with gather/scatter support, if your vector load data from different memory location for each element, you are not going to get good performance. But even with a cache, I don't see how you can get a good performance from that either.
Of course, if your data structure happens to be fit inside the cache, it would be quite helpful. However, in real world applications this is very rare. There are some different possible situations. It's possible that your data access pattern is very cache friendly and the data fit into the cache nicely. Then the cache wins hand down. Another possibility is that the data access pattern is very random and data dependent so it's almost impossible to do anything about it. Then cache is not helpful at all. The third possibility is you'll need to do some "blockification" to make your data access more cache friendly (generally this will have to take the size of the cache into consideration). This is probably the most common situation (for vector friendly codes). In this case, it's almost always possible to "blockify" the data access pattern so a software managed scratch pad can handle it well enough.
Of course, it's still possible that future GPU may converge with Larrabee a bit. For example, future GPU may have a few scalar processing units with nearly full CPU functions (and maybe with cache!) to control its vector units.